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Parallel  Genetic  Algorithms  for  Hypercube 

Machines 


Ranieri  Baraglia  and  RafFaele  Perego 
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Abstract  In  this  paper  we  investigate  the  design  of  highly  parallel  Ge¬ 
netic  Algorithms.  The  Traveling  Salesman  Problem  is  used  as  a  case 
study  to  evaluate  and  compare  different  implementations.  To  fix  the  var¬ 
ious  parameters  of  Genetic  Algorithms  to  the  case  study  considered,  the 
Holland  sequential  Genetic  Algorithm,  which  adopts  different  population 
replacement  methods  and  crossover  operators,  has  been  implemented 
and  tested.  Both  fine  -  grained  and  coarse  -  grained  parallel  GAs 
which  adopt  the  selected  genetic  operators  have  been  designed  and  im¬ 
plemented  on  a  128-node  nCUBE  2  multicomputer.  The  fine  -  grained 
algorithm  uses  an  innovative  mapping  strategy  that  makes  the  number  of 
solutions  managed  independent  of  the  number  of  processing  nodes  used. 
Complete  performance  results  showing  the  behavior  of  Paredlel  Genetic 
Algorithms  for  different  population  sizes,  number  of  processors  used,  mi¬ 
gration  strategies  are  reported. 


1  Introduction 

Genetic  Algorithms  (GAs)  [11, 12]  axe  stochastic  optimization  heuristics  in  which 
searches  in  solution  space  are  carried  out  by  imitating  the  population  genetics 
stated  in  Darwin’s  theory  of  evolution.  Selection,  crossover  and  mutation  oper¬ 
ators,  directly  derived  by  from  natural  evolution  mechanisms  are  applied  to  a 
population  of  solutions,  thus  favoring  the  birth  and  survival  of  the  best  solutions. 
GAs  have  been  successfully  applied  to  many  NP-hard  combinatorial  optimiza¬ 
tion  problems  [6],  in  several  application  fields  such  as  business,  engineering,  and 
science. 

In  order  to  apply  GAs  to  a  problem,  a  genetic  representation  of  each  individ¬ 
ual  {chromosome)  that  constitutes  a  solution  of  the  problem  has  to  be  found. 
Then,  we  need  to  create  an  initial  population,  to  define  a  cost  function  to  mea¬ 
sure  the  fitness  of  each  solution,  and  to  design  the  genetic  operators  that  will 
allow  us  to  produce  a  new  population  of  solutions  from  a  previous  one.  By  iter¬ 
atively  applying  the  genetic  operators  to  the  current  population,  the  fitness  of 
the  best  individuals  in  the  population  converges  to  local  optima. 

Figure  1  reports  the  pseudo-code  of  the  Holland  genetic  algorithm.  After  ran¬ 
domly  generating  the  initial  population  ^(0),  the  algorithm  at  each  iteration  of 
the  outer  repeat — imtil  loop  generates  a  new  population  I3{t  +  1)  from  (3{t)  by 
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selecting  the  best  individuals  of  (3{t)  (function  SELECTO)  and  probabilistically 
applying  the  crossover  and  mutation  genetic  operators.  The  selection  mechanism 
must  ensure  that  the  greater  the  fitness  of  an  individual  Ak  is,  the  higher  the 
probability  of  Ak  being  selected  for  reproduction.  Once  Ak  has  been  selected, 
Pc  is  its  probability  of  generating  a  son  by  applying  the  crossover  operator  to 
Ak  and  another  individual  At,  while  Pm  and  Pj  are  the  probabilities  of  apply¬ 
ing  respectively,  mutation  and  inversion  operators  to  the  generated  individual 
respectively. 

The  crossover  operator  randomly  selects  parts  of  the  parents’  chromosomes 
and  combines  them  to  breed  a  new  individual.  The  mutation  operator  randomly 
changes  the  value  of  a  gene  (a  single  bit  if  the  binary  representation  scheme  is 
used)  within  the  chromosome  of  the  individual  to  which  it  is  applied.  It  is  used 
to  change  the  current  solutions  in  order  to  avoid  the  convergence  of  the  solutions 
to  “bad”  local  optima. 

The  new  individual  is  then  inserted  into  population  /3(t  +  1).  Two  main  re¬ 
placement  methods  can  be  used  for  this  purpose.  By  adopting  the  discrete  popu¬ 
lation  model,  the  whole  population  /3(t)  is  replaced  by  new  generated  individuals 
at  the  end  of  the  outer  loop  iteration.  A  variation  on  this  model  was  proposed  in 
[13]  by  using  a  parameter  that  controls  the  percentage  of  the  population  replaced 
at  each  generation.  The  continuous  population  model  states,  on  the  other  hand, 
that  the  new  individuals  are  soon  inserted  into  the  current  population  to  replace 
older  individuals  with  worse  fitness.  This  replacement  method  allows  potentially 
good  individuals  to  be  exploited  as  soon  as  they  become  available. 

Irrispective  of  the  replacement  policy  adopted,  population  /3(t+l)  is  expected 
to  contain  a  greater  number  of  individuals  with  good  fitness  than  population  f3{t) . 
The  GA  end  condition  can  be  to  reach  a  maximum  number  of  generated  pop¬ 
ulations,  after  which  the  algorithm  is  forced  to  stop  or  the  algorithm  converges 
to  stable  average  fitness  values. 

The  following  are  some  important  properties  of  GAs; 

-  they  do  not  deal  directly  with  problem  solutions  but  with  their  genetic  rep¬ 
resentation  thus  making  GA  implementation  independent  from  the  problem 
in  question; 

-  they  do  not  treat  individuals  but  rather  populations,  thus  increasing  the 
probability  of  finding  good  solutions; 

-  they  use  probabilistic  methods  to  generate  new  populations  of  solutions,  thus 
avoiding  being  trapped  in  “bad”  local  optima. 

On  the  other  hand,  GAs  do  not  guarantee  that  global  optima  will  be  reached 
and  their  effectiveness  very  much  depends  on  many  parameters  whose  fixing 
may  depend  on  the  problem  considered.  The  size  of  the  population  is  particularly 
important.  The  larger  the  population  is,  the  greater  the  possibility  of  reaching  the 
optimal  solution.  Increasing  the  population  clearly  results  in  a  large  increase  in 

GA  computational  cost  which,  as  we  will  see  later,  can  be  mitigated  by  exploiting 
parallelism.  .r  f  & 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  briefly  describes  the 
computational  models  proposed  to  design  parallel  GAs;  Section  3  introduces 
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Program  Holland-Genetic-Algorithm\ 

begin 

t=0i 

(3  {t)  =  INITIALJ>OPULATION()  ; 
repeat 

for  i  =  1  to  number  jof -individuals  do 
F{Ai)  =  COMPUTEJ’ITNESS(Ai); 

Average.fitness  =  COMPUTE_AVERAGE_FITNESS(F); 

for  k  =  1  to  number  jof  -individuals  do 

begin 

Ak  =  SELECT(/?  (<)); 
if  {Pc  >  random(0, 1))  then 
begin 

Ai  =  SELECT(/3  (t)); 

Aehiid  =  CROSSOVER(  Ai,Ak); 

if  {Pm  >  random  (0,1))  then  MUTATION  {AchUd)', 

P  {t  +  l)=UPDATE-POPULATION  {AchUdY, 

end 

end; 

t=t+l; 

until  {endjcondition)- 

end  _ 


Figurel.  Pseudo-code  of  the  Holland  Genetic  Algorithm. 


the  Traveling  Salesman  Problem  used  as  our  case  study,  discusses  the  imple¬ 
mentation  issues  and  presents  the  results  achieved  on  a  128-node  hypercube 
multicomputer;  finally  Section  4  outlines  the  conclusions. 

2  Parallel  Genetic  Algorithms 

The  availability  of  ever  faster  parallel  computers  means  that  parallel  GAs  can 
be  exploited  to  reduce  execution  times  and  improve  the  quality  of  the  solutions 
reached  by  increasing  the  sizes  of  populations  managed. 

In  [5, 3]  the  parallelization  models  adopted  to  implement  GAs  are  classified. 
The  models  described  are: 

-  centralized  model.  A  single  unstructured  panmitic  population  is  processed 
in  parallel.  A  master  processor  manages  the  population  and  the  selection 
strategy  and  requests  a  set  of  slave  processors  to  compute  the  fitness  function 
and  other  genetic  operators  on  the  chosen  individuals.  The  model  scales 
poorly  and  explores  the  solution  space  like  a  sequential  algorithm  which  uses 
the  same  genetic  operators.  Several  implementations  of  centralized  parallel 
GAs  are  described  in  [1]. 
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—  fine-grained  model.  This  model  operates  on  a  single  structured  popu¬ 
lation  by  exploiting  the  concepts  of  spatidity  and  neighborhood.  The  first 
concept  defines  that  a  very  small  sub-population,  ideally  just  an  individual, 
is  stored  in  one  element  (node)  of  the  logical  connection  topology  used,  while 
the  second  specifies  that  the  selection  and  crossover  operators  are  applied 
only  between  individuals  located  on  nearest-neighbor  nodes.  The  neighbors 
of  an  individual  determine  all  its  possible  partners,  but  since  the  neighbor 
sets  of  partner  nodes  overlap,  this  provides  a  way  to  spread  good  solutions 
across  the  entire  population.  Because  of  its  scalable  communication  pattern, 
this  model  is  particularly  suited  for  massively  parallel  implementations.  Im¬ 
plementations  of  fine-grained  parallel  GAs  applied  to  different  application 
problems  can  be  found  in  [8,9, 14, 15, 19]. 

-  coarse-grained  model.  The  whole  population  is  partitioned  into  sub-popu¬ 
lations,  called  islands,  which  evolve  in  parallel.  Each  island  is  assigned  to  a 
different  processor  and  the  evolution  process  takes  place  only  among  individ¬ 
uals  belonging  to  the  same  island.  This  feature  means  that  a  greater  genetic 
diversity  can  be  maintained  with  respect  to  the  exploitation  of  a  panmitic 
population,  thus  improving  the  solution  space  exploration.  Moreover,  in  or¬ 
der  to  improve  the  sub-population  genotypes,  a  migration  operator  that  pe¬ 
riodically  exchanges  the  best  solutions  among  different  islands  is  provided. 
Depending  on  the  migration  operator  chosen  we  can  distinguish  between 
island  and  stepping  stone  implementations.  In  island  implementations  the 
migration  occurs  among  every  island,  while  in  stepping  stone  implementa¬ 
tions  the  migration  occurs  only  between  neighboring  islands.  Studies  have 
shown  that  there  are  two  critical  factors  [10]:  the  number  of  solutions  mi¬ 
grated  each  time  and  the  interval  time  between  two  consecutive  migrations. 
A  large  number  of  migrants  leads  to  the  behavior  of  the  island  model  similar 
to  the  behavior  of  a  panmitic  model.  A  few  migrants  prevent  the  GA  from 
mixing  the  genotypes,  and  thus  reduce  the  possibility  to  bypass  the  local 
optimum  value  inside  the  islands.  Implementations  of  coarse  grained  parallel 
GAs  can  be  found  in  [10,20,21,4,18,16]. 

3  Designing  parallel  GAs 

We  implemented  both  fine-grained  and  coarse-grained  parallel  GAs  applied 
to  the  classic  Traveling  Salesman  Problem  on  a  128-node  nCUBE  2  hypercube. 
Their  performance  was  measured  by  varying  the  type  and  value  of  some  genetic 
operators.  In  the  following  subsection  the  TSP  case  study  is  described  and  the 
parallel  GA  implementations  are  discussed  and  evaluated. 

3.1  The  Traveling  Salesman  Problem 

The  Traveling  Salesman  Problem  (TSP)  may  be  formally  defined  as  follow:  let 
C  =  {ci,C2, . ,Cn}  be  a  set  of  n  cities  and  Vi,  Vj  d{ci,Cj)  the  distance  be¬ 

tween  city  a  and  Cj  with  d{ci,Cj)  =  d{cj,Ci).  Solving  the  TSP  entails  finding  a 
permutation  tt/  of  the  cities  (c,-(i),c„,(2), . ,c,r-(„)),  such  that 
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n  n 

Vtt*' #7r',(n  +  l)  =  1  (1) 

1=1  i=l 

According  to  the  TSP  path  representation  described  in  [9],  tours  are  rep¬ 
resented  by  ordered  sequences  of  integer  numbers  of  length  n,  where  sequence 

(7r(l),;r(2), . ,7r(n))  represents  a  tour  joining,  in  the  order,  cities  c,r(i), c,r(2)) 

. ,  c,(„).  The  search  space  for  the  TSP  is  therefore  the  set  of  all  permutations 

of  n  cities.  The  optimal  solution  is  a  permutation  which  yields  the  minimum  cost 
of  the  tour. 

The  TSP  instances  used  in  the  tests  are:  GR48,  a  48-city  problem  that  has 
an  optimal  solution  equaJ  to  5046,  and  LIN105,  a  105-city  problem  that  has  a 
14379  optimal  solution^. 


3.2  Fixing  the  genetic  operators 

In  order  to  study  the  sensitivity  of  the  GAs  for  the  TSP  to  the  setting  of  the 
genetic  operators,  we  used  Holland’s  sequentiad  GA  by  adopting  the  discrete 
generation  model,  one  and  two  point  crossover  operators,  and  three  different 
population  replacement  criteria. 


The  discrete  generation  model  separates  sons’  population  from  parents’ 
population.  Once  all  the  sons’  population  has  been  generated,  it  is  merged 
with  the  parents’  population  according  to  the  replacement  criteria  adopted 

One  point  crossover  breaks  the  parents’  tours  into  two  parts  and  recombines 
them  in  the  son  in  a  way  that  ensures  tour  legality  [2].  Two  points  crossover 
[7]  works  like  the  one  point  version  but  breaks  the  parents’  tours  into  three 
different  parts.  A  mutation  operator  which  simply  exchanges  the  order  of 
two  cities  of  the  tour  has  been  also  implemented  and  used  [9] . 

The  replacement  criterion  specifies  a  rule  for  merging  current  and  new  pop¬ 
ulations.  We  tested  three  different  replacement  criteria,  called  Rl,  R2  and 
R3.  Rl  replaces  solutions  with  lower  fitnesses  of  the  current  population  with 
all  the  son  solutions  unaware  of  their  fitness.  R2  orders  the  sons  by  fitness, 
and  replaces  am  individual  i  of  the  current  population  with  son  j  only  if  the 
fitness  of  i  is  lower  than  the  fitness  of  j.  R2  has  a  higher  control  on  the  pop¬ 
ulation  than  Rl,  and  allows  only  the  best  sons  to  enter  the  new  population. 
R3  selects  the  parents  with  a  lower  than  average  fitness,  and  replaces  them 
with  the  sons  with  above  average  fitnesses. 


The  tests  for  setting  the  genetic  operators  were  carried  out  by  using  a  640 
solution  population,  a  0.2  mutation  parameter  (to  apply  a  mutation  to  20%  of 
the  total  population) ,  2000  generations  for  the  48-city  TSP,  and  3000  generations 

'  Both  the  TSP  instances  are  available  at:  ftp://elib.zib-berlin.de/pub/mp-test- 
data/tsp/tsplib . html 
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for  the  105-city  TSP.  Every  test  was  run  32  times,  starting  from  different  random 
populations,  to  obtain  an  average  behavior.  From  the  results  of  the  32  tests  we 
computed: 


fe 

-  the  average  solution:  AVG  =  ,  where  Fsi  is  the  best  fitness  ob¬ 

tained  with  run  Ef, 

-  the  best  solution:  BST  =  min{FEi,i  =  I,-  -  -  ,32}; 

-  the  worst  solution:  WST  =  max{FEi,i  =  1,  •  •  •  ,32}. 

These  preliminary  tests  allow  us  to  choose  some  of  the  most  suitable  genetic 
operators  and  parameters  for  the  TSP.  Figure  2  plots  the  average  fitness  ob¬ 
tained  by  varying  the  crossover  type  on  the  48-city  TSP  problem.  The  crossover 
was  applied  to  40%  of  the  population  and  the  R2  replacement  criterion  was  used. 
As  can  be  seen,  the  two  point  crossover  converges  to  better  average  solutions 
than  the  one  point  operator.  The  one  point  crossover  initially  exhibits  a  better 
behavior,  but  after  2000  generations,  converges  to  solutions  that  have  consid¬ 
erably  higher  costs.  We  obtained  a  similar  behavior  for  the  other  replacement 
criteria  and  for  the  105-city  TSP. 

Table  1  reports  AVG,  BST  and  WST  results  for  the  48-city  TSP  obtained 
by  varying  both  the  population  replacement  criterion  and  the  percentage  of 
the  population  to  which  the  two  point  crossover  has  been  applied.  On  average, 
the  crossover  parameter  values  in  the  range  0.4  —  0.6  lead  to  better  solutions, 
almost  irrispective  of  the  replacement  criterion  adopted.  Figure  3  shows  the 
behavior  of  the  various  replacement  criteria  for  a  0.4  crossover  value.  The  R2 
and  R3  replacement  criteria  resulted  in  a  faster  convergence  than  Rl,  and  they 
converged  to  very  near  fitnesses. 


Figure2.  Fitness  values  obtained  with  the  execution  of  the  sequential  GA  on  the  48- 
city  TSP  by  varying  the  crossover  operator,  and  by  using  the  R2  replacement  criteria. 
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■ 

1  Crossover  parameter  \ 

■Q 

IKS 

KSsll 

AVG 

5632 

5585 

5870 

R1 

BST 

5315 

5135 

5305 

WST 

7828 

6079 

6231 

6693 

AVG 

5902 

5696 

5735 

5743 

R2 

BST 

5405 

5243 

5323 

5410 

WST 

7122 

6180 

6225 

6243 

AVG 

6251 

5669 

5722 

5773 

R3 

BST 

5441 

5178 

5281 

5200 

WST 

7354 

6140 

6594 

Tablel.  Fitness  values  obtained  with  the  execution  of  the  sequential  GA  on  the  48-city 
TSP  by  varying  the  value  of  the  crossover  parameter  and  the  population  replacement 
criterion. 


Figures.  Fitness  values  obtained  by  executing  the  sequential  GA  on  the  48-city  TSP 
with  a  0.4  crossover  parameter,  and  by  varying  the  replacement  criterion. 


3.3  The  coarse  grained  implementation 

The  coarse  grained  parallel  G A  was  designed  according  to  the  discrete  generation 
and  stepping  stone  models.  Therefore,  the  new  solutions  are  merged  with  the 
current  population  at  the  end  of  each  generation  phase,  and  the  migration  of 
the  best  individuals  among  sub-population  is  performed  among  ring-connected 
islands.  Each  of  the  P  processors  manages  N/P  individuals,  with  N  population 
size  (640  individuals  in  our  case).  The  number  of  migrants  is  a  fixed  percentage 
of  the  sub-population.  As  in  [4],  migration  occurs  periodically  in  a  regular  time 
rhythm,  after  a  fixed  number  of  generations. 

In  order  to  include  all  the  migrants  in  the  current  sub-populations,  and  to 
merge  the  sub-population  with  the  locally  generated  solutions,  R1  and  R2  re¬ 
placement  criteria  were  used,  respectively.  Moreover,  the  two  point  crossover 
operator  was  adopted. 
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Table  2  reports  some  results  obtained  by  running  the  coarse  grained  parallel 
GA  on  the  48-city  TSP.  M  denotes  the  migration  parameter.  The  same  data  for 
a  migration  parameter  equal  to  0.1  are  plotted  in  Figure  4.  It  can  be  seen  that 
AVG,  BST  and  WST  solutions  get  worse  values  by  increasing  the  number  of  the 
nodes  used.  This  depends  on  the  constant  population  size  used:  with  4  nodes 
sub-populations  of  160  solutions  are  exploited,  while  with  64  nodes  the  sub¬ 
populations  only  consists  of  10  individuals.  Decreasing  the  number  of  solutions 
that  forms  a  sub-population  worsens  the  search  in  the  solution  space;  small 
sub-populations  result  in  an  insufficient  exploration  of  the  solution  space.  The 
influence  of  the  number  of  migrants  on  the  convergence  is  clear  from  Table  2. 
When  the  sub-populations  are  small,  a  higher  value  of  the  migration  parameter 
may  improve  the  quality  of  solutions  through  a  better  mix  of  the  genetic  material. 


■ 

Number  of  processing  nodes  | 

1  ^ 

8 

iKa 

mat 

11^ 

AVG 

o 

00 

5786 

5933 

6080 

6383 

6995 

M=0.1 

BST 

5438 

5315 

5521 

5633 

5880 

6625 

WST 

6250 

6387 

6516 

6648 

8177 

8175 

AVG 

5807 

5877 

5969 

6039 

6383 

6623 

M=0.3 

BST 

5194 

5258 

5467 

5470 

5727 

6198 

WST 

6288 

6644 

7030 

6540 

8250 

7915 

AVG 

5900 

00 

5870 

6067 

6329 

6617 

M=0.5 

BST 

5419 

5475 

5483 

5372 

6017 

6108 

WST 

6335 

6550 

7029 

6540 

8250 

7615 

Table2.  Fitness  values  obtained  with  the  execution  of  the  coarse  grained  GA  on  the 
48-city  TSP  by  varying  the  migration  parameter. 


3.4  The  fine  grained  implementation 

The  fine  grained  parallel  GA  was  designed  according  to  the  continuous  generation 
model,  which  is  much  more  suited  for  fine  grained  parallel  GAs  than  the  discrete 
one.  The  two  point  crossover  operator  was  applied. 

According  to  the  fine  grained  model  the  population  is  structured  in  a  logic 
topology  which  fixes  the  rules  of  interaction  between  the  solution  and  other 
solutions;  each  solution  s  is  placed  at  a  vertex  v{s)  of  logic  topology  T.  The 
crossover  operation  can  only  be  applied  among  nearest  neighbor  solutions  placed 
on  the  vertices  directly  connected  in  T.  Our  implementation  exploits  the  physical 
topology  of  the  target  multicomputer,  therefore  the  population  of  2^  individuals 
is  structured  as  a  A^-dimensional  hypercube. 

By  exploiting  the  recursivity  of  the  hypercube  topology  definition,  we  made 
the  number  of  solutions  treated  independent  of  the  number  of  nodes  used  to 
execute  the  algorithm.  As  can  be  seen  in  Figure  5,  a  2^  =8  solution  population 
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Figure4.  AVG,  BST  and  WST  values  obtained  by  executing  the  coarse  grained  GA 
on  the  48-city  TSP  as  a  function  of  the  number  of  nodes  used,  and  0.1  as  migration 
pau-ameter. 


can  be  placed  on  a  2^  =  4  node  hypercube,  using  a  simple  mapping  function 
which  masks  the  first  (or  the  last)  bit  of  the  Grey  code  used  to  numerate  the 
logical  hypercube  vertices  [17].  Physical  node  XOO  will  hold  solutions  000  and 
100  of  the  logic  topology,  not  violating  the  neighborhood  relationships  fixed  by 
the  population  structure.  In  fact,  the  solutions  on  the  neighborhood  of  each 
solution  s  will  still  be  in  the  physical  topology  on  directly  connected  nodes  or 
on  the  node  holding  s  itself. 

We  can  generalize  this  mapping  scheme:  to  determine  the  allocation  of  a  2^ 
solution  population  on  a  2^  node  physical  hypercube,  with  M  <  N,  we  simply 
mask  the  first  (the  last)  N  -  M  bits  of  the  binary  coding  of  each  solution. 


I  X(X)| - 1  xi()| 


I  x»i  I - 1  xn  I 

Phy.sical  iop()h)gy 


Figures.  Example  of  an  application  of  the  mapping  scheme. 


Table  3  reports  the  fitness  values  obtained  with  the  execution  of  the  fine 
grained  GA  by  varying  the  population  dimension  from  128  solutions  (a  7  dimen¬ 
sion  hypercube)  to  1024  solutions  (a  10  dimension  hypercube).  As  expected,  the 
ability  to  exploit  population  sizes  larger  than  the  number  of  processors  used  in 
our  mapping  scheme,  leads  to  better  quality  solutions  especially  when  few  pro- 
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cessing  nodes  are  used.  The  improvement  in  the  fitness  values  by  increasing  the 
number  of  nodes  while  maintaining  the  population  size  fixed,  is  due  to  a  partic¬ 
ular  feature  of  the  implementation,  which  aims  to  minimize  the  communication 
times  to  the  detriment  of  the  diversity  of  the  same  node  solutions.  Selection  rules 
tend  to  choose  partner  solutions  in  the  same  node.  The  consequence  is  a  greater 
uniformity  in  solutions  obtained  on  few  nodes,  which  worsens  the  exploration  of 
the  solution  space.  The  coarse  grained  implementation  suffered  of  the  opposite 
problem  which  resulted  in  worse  solutions  obtained  as  the  number  of  nodes  was 
increased.  This  behavior  can  be  observed  by  comparing  Figure  4,  concerning  the 
coarse  grained  GA,  and  Figure  6,  concerning  the  fine  grained  algorithm  with  a 
128  solution  population  applied  to  the  48-city  TSP. 

Table  4  shows  that  an  increase  in  the  number  of  solutions  processed  results  in  a 
corresponding  increase  in  the  speedup  values.  This  is  because  a  larger  number  of 
individuals  assigned  to  the  same  processor  leads  to  lower  communication  over¬ 
heads  for  managing  the  interaction  of  each  individual  with  neighbor  partners. 


Population 

size 

■ 

Number  of  processing  nodes  | 

1  1 

4 

8 

16 

32 

64 

128 

■Mi 

AVG 

1  39894 

24361 

23271 

23963 

22519 

22567 

BST 

20774 

19830 

20532 

20269 

20593 

WST 

28256 

25931 

AVG 

33375 

25313 

22146 

21616 

21695 

22247 

21187 

256 

BST 

24059 

19833 

19759 

WST 

26998 

o 

o 

23973 

24337 

22256 

AVG 

28422 

23193 

21553 

20111 

20364 

512 

BST 

28987 

22126 

19336 

20333 

19093 

18985 

18917 

WST 

41020 

25684 

22807 

22213 

21696 

21647 

AVG 

20366 

19370 

18948 

19152 

1024 

BST 

27010 

21581 

21480 

18830 

18256 

18252 

17446 

WST 

40901 

25307 

22757 

21714 

20766 

19525 

20661 

Tables.  Fitness  values  obtained  with  the  execution  of  the  fine  grained  GA  applied  to 
the  105-city  TSP  after  3000  generations,  by  varying  the  number  of  solutions  per  node. 


3.5  Comparisons 

We  compared  the  fine  and  coarse  grained  algorithms  on  the  basis  of  the  execution 
time  required  and  the  fitness  values  obtained  by  each  one  after  the  evaluation 
of  512000  solutions.  This  comparison  criterion  was  chosen  because  it  allows  to 
overcome  computational  models  diversity  that  make  non  comparable  the  fine  and 
coarse  grained  algorithms.  The  evaluation  of  512000  new  solutions  allows  both 
the  algorithms  to  converge  and  requires  comparable  execution  times.  Table  5 
shows  the  results  of  this  comparison.  It  can  be  seen  that  when  the  number  of 
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FigureG.  AVG,  BST  and  WST  values  obtained  by  executing  the  fine  grained  GA 
applied  to  the  48-city  TSP  as  a  function  of  the  number  of  nodes  used.  The  population 
size  was  fixed  to  128  individuals. 


Number 
of  nodes 

\  Number  of  individuals  | 

1 

m 

1 

1 

1 

4 

3.92 

8 

RTifl 

7.74 

16 

IBQ 

fEltil 

15.12 

32 

25.25 

27.02 

64 

44.84 

49.57 

53.29 

56.13 

WEsm 

|Q|[[g 

98.06 

105.321 

Table4.  Speedup  of  the  of  the  fine  grained  GA  applied  on  the  105-city  TSP  for  different 
population  sizes. 


the  nodes  used  increases  the  fine  grained  algorithm  gets  sensibly  better  results 
than  the  coarse  grained  one.  On  the  other  hand,  the  coarse  grained  algorithm 
shows  a  super-linear  speedup  due  to  the  quick  sort  algorithm  used  by  each 
node  for  ordering  by  fitness  the  solutions  managed.  As  the  number  of  nodes 
is  increased,  the  number  of .  individuals  assigned  to  each  node  decreases,  thus 
requiring  considerably  less  time  to  sort  the  sub-population. 

4  Conclusions 

We  have  discussed  the  results  of  the  application  of  parallel  GA  algorithms  to 
the  TSP.  In  order  to  analyze  the  behavior  of  different  replacement  criteria  and 
crossover  operators  and  values  Holland’s  sequential  GA  was  implemented.  The 
tests  showed  that  the  two  point  crossover  finds  better  solutions,  as  does  a  re¬ 
placement  criteria  which  replaces  an  individucd  i  of  the  current  population  with 
son  j  only  if  the  fitness  of  i  is  worse  than  the  fitness  of  j.  To  implement  the 
fine  —  grained  and  coarse  —  grained  parallel  GAs  on  a  hypercube  parallel  com- 
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Number  of  processing  nodes 

- i 

4 

8 

16 

32 

26373 

26349 

25922 

25992 

24613 

23227 

23979 

23906 

23173 

21992 

22145 

20669 

30140 

27851 

29237 

29193 

30176 

26802 

606 

321 

51 

Coarse  grained 
AVG 

23860 

24526 

27063 

29422 

32542 

35342 

BST 

20219 

21120 

23510 

25783 

30927 

33015 

WST 

25299 

29348 

36795 

39330 

39131 

41508 

Execution  times 

1670 

804 

392 

196 

97 

47 

Tables.  Fitness  values  and  execution  times  (in  seconds)  obtained  by  executing  the  fine 
and  coarse  grained  GA  applied  to  the  105-city  TSP  with  a  population  of  128  and  640 
individuals,  respectively. 


puter  the  most  suitable  operators  were  adopted.  For  the  coarse  grained  GA  we 
observed  that  the  quality  of  solutions  gets  worse  if  he  number  of  nodes  used 
is  increased.  Moreover,  due  to  the  sorting  algorithm  used  to  order  each  sub¬ 
population  by  fitness,  the  speedup  of  the  coarse  grained  GA  were  super-linear. 
Our  fine-grained  algorithm  adopts  a  mapping  strategy  that  allows  the  number 
of  solutions  to  be  independent  of  the  number  of  nodes  used.  The  ability  to  ex¬ 
ploit  population  sizes  larger  than  the  number  of  processors  used  gives  better 
quality  solutions  especially  when  only  a  few  processing  nodes  are  used.  More¬ 
over,  the  quality  of  solutions  does  not  get  worse  if  the  number  of  the  nodes  used 
is  increased.  The  fine  grained  algorithm  showed  good  scalability.  A  comparison 
between  the  fine  and  coarse  grained  algorithms  highlighted  that  fine  grained  algo¬ 
rithms  represent  the  better  compromise  between  quality  of  the  solution  reached 
and  the  execution  time  spent  on  finding  it. 

The  G As  implemented  reached  only  “good”  solutions.  In  order  to  improve  the 
quality  of  solutions  obtained,  we  are  working  to  include  a  local  search  procedure 
within  the  GA. 
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Abstract.  The  paper  presents  a  parallel  algorithm  for  generating  co¬ 
nics  on  a  raster  scan  display  device.  Unlike  the  Wright’s  algorithms  [ll] 
for  drawing  lines  and  circles,  which  are  based  on  geometric  decomposi¬ 
tions  of  the  curves,  the  new  algorithm  leads  to  an  equal  distribution  of 
continuos  sets  of  pixels  between  the  drawing  processes.  As  a  practiced 
application,  a  parallel  z-buffer  algorithm  for  quadric  surfaces  was  intro¬ 
duced.  The  classical  Wge  set  of  filled  fwlygons  was  replaced  by  a  set  of 
conics  representing  some  section  curves  of  the  quadrics.  Tests  have  been 
performed  on  a  transputer  machine  and  on  a  distributed  network*. 


1  Introduction 

Parallel  rendering  offers  the  potential  for  high  performance.  Realizing  this  po¬ 
tential  requires  a  careful  analysis  of  all  the  algorithms  involved,  especially  in 
the  generation  of  life-like  images,  known  as  realistic  image  synthesis,  since  the 
techniques  falling  in  this  category  are  notorious  for  their  high  computational 
complexity.  A  such  analysis  must  start,  first  of  all,  with  the  drawing  algorithms 
of  the  basic  graphics  primitives.  Three  major  performance  bottlenecks  consis¬ 
tently  resist  attempts  to  increase  rendering  speed:  the  number  of  floating-point 
operations  to  perform  geometrical  calculations,  the  number  of  integer  operations 
to  compute  pixel  values,  and  the  number  of  frame-buffer  accesses  to  store  the 
image  and  to  determine  visible  surfaces.  Several  issues  must  be  considered  while 
designing  a  parallel  system.  Some  of  the  more  important  of  them  are  load  ba¬ 
lancing,  network  congestion,  parallelization  overheads,  coherency  exploitation, 
algorithm  embedding,  and  suitability  to  general-purpose  parallel  computers.  Al¬ 
though  parallelism  has  been  employed  in  computer  graphics  since  the  early  days 
of  the  field,  applying  it  effectively  is  a  complex  problem. 


1.1  Parallel  scan-conversion  algorithms  for  conics 

Scan-conversion  algorithmis  use  incremental  methods  to  minimize  the  number  of 
calculations  performed  during  each  iteration.  Bresenham’s  classic  algorithm  [1] 
for  drawing  lines  and  circles  is  attreictive  because  it  uses  only  integer  arithmetic. 

'  This  work  was  partially  supported  by  IWR  Institute,  University  of  Heidelberg 
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For  the  more  general  case  of  a  conic,  special  incremental  algorithms,  like  those 
presented  in  [4]  or  [3],  have  been  constructed  using  Bresenham’s  idea. 

The  Wright’s  papers  [11]  and  [10]  report  on  some  methods  of  achieving  pa¬ 
rallelism  in  a  raster  graphics  system  by  paralleling  the  Bresenham’s  algorithms 
for  drawing  lines  and  circles.  The  parallel  circle  algorithm  presented  by  Wright 
divides  the  work  among  p  processors  splitting  the  7r/4  arc  from  north  to  north¬ 
east  (the  octant  2,  the  rest  of  the  circle  is  constructed  by  symmetries)  into  p 
approximately  equal  subarcs  (with  the  same  length),  and  assigning  a  different 
processor  to  determine  the  raster  representation  for  each  subarcs.  Wright’s  re¬ 
mark  was  that  the  times  for  different  processors  are  not  necessarily  the  same  or 
similar,  primarily  due  to  the  fact  that  the  numbers  of  pixels  treated  by  distinct 
processors  are  not  equal.  The  same  remark  we  can  make  in  the  case  when  we 
try  to  draw  a  large  set  of  circles  and  we  apply  a  pool  of  tasks  technique,  a  task 
being  the  drawing  procedure  of  a  circle. 

Rather  than  dividing  the  circular  arc  into  many  equal  sub-arcs,  the  algorithm 
proposed  recently  by  Huang  and  Banissi  [6]  segments  the  horizontal  length  of 
octant  2  into  many  equal  parts.  Some  arrangement  have  been  made  to  include 
only  multiplication,  addition  and  shift  operations.  Values  of  sine  and  cosine 
functions  must  be  pretabulated  outside  the  iteration  loop. 

1.2  Modeling  systems  based  on  quadrics 

Most  surface-rendering  systems  render  a  set  of  polygons  that  approximate  the 
model  representation.Polygons  are  simple,  regular  primitives  that  are  convenient 
to  display,  so  that  polygon  rendering  is  usually  more  efficient  and  numerically 
robust  than  direct  surface  rendering.  Unfortunately,  the  polyhedral  model  is 
only  an  approximation  to  the  real  surface  and  frequently  aliasing  and  inaccura¬ 
cies  can  occur  (especially  for  complex,  curved  surfaces).  Other  representations, 
such  as  section  curves  patches,  are  more  convenient  to  specify  when  modeling  a 
curved  surface.  Rendering  the  surface  as  a  set  of  curves  is  appealing  because  the 
representation  of  each  curve  is  ex0w:t  (the  need  for  the  anti-aliasing  techniques 
developed  for  rendering  polygonal  approximations  is  also  eliminated). 

Quadric  surfaces  are  particularly  useful  in  specialized  applications  such  as 
molecular  modeling,  and  have  also  been  integrated  into  geometric  and  solid  mo¬ 
deling  systems,  like  that  presented  in  [7].  The  natural  quadrics  (the  spheres,  the 
circular  cylinders  and  the  right  cones)  are  the  most  natural  objects  to  model 
mechanical  parts.  The  reasons  for  using  quadrics  include  ease  of  computing  the 
surface  normal,  testing  whether  a  point  is  on  the  surface,  computing  the  depth 
of  the  point  which  was  projected  into  a  plane  (important  in  hidden-surface  al¬ 
gorithms),  calculating  intersections  of  one  surface  with  another  [4]. 


1.3  Rendering  curved  surfaces 

The  image  synthesis  supposes  three  distinct  phases  [9]:  preprocessing  -  consists 
in  data  read,  transformation  of  points,  clipping,  projection;  rendering  -  includes 
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hidden-surface  removal,  shading,  and  any  other  visual  effects;  postprocessing  - 
involves  displaying  the  image  on  a  frame  buffer. 

Volume  rendering  is  a  computationally  intensive  task.  Current  graphics  com¬ 
puters  are  not  fast  enough  to  generate  high  quality  images  at  interactive  speeds. 
High  resolution  volume  rendering  can  take  minutes  to  hours  on  a  uniprocessor 
workstation  and  the  rendering  time  usually  grows  linearly  with  the  data  size. 
Moreover,  volume  data  sets  can  be  very  large,  often  too  large  for  a  workstation 
to  hold  in  memory  at  once. 

Surface  rendering  is  traditionally  performed  by  approximating  the  surface 
with  polygons  and  then  rendering  the  polygons.  Almost  all  visible-surface  algo¬ 
rithms  were  described  for  objects  defined  by  polygonal  facets  (one  exception  in 
the  2-buffer  algorithm  which  does  not  require  that  objects  be  polygons).  Ob¬ 
jects  such  as  the  curved  surfaces  must  first  approximated  by  many  small  facets 
before  polygonal  versions  of  any  of  these  algorithms  can  be  used.  Although  a 
such  approximation  can  be  done,  it  is  often  preferable  to  scan  convert  curved 
surfaces  directly,  eliminating  polygonal  artifacts  and  avoiding  the  extra  stora¬ 
ge  required  by  polygonal  approximation.  Special  visible-surface  algorithms  for 
quadrics  have  been  developed  (they  all  find  the  intersections  of  two  quadrics, 
yielding  a  complicated  equation  whose  roots  must  be  found  numerically  [4]). 

Recently  several  papers  presented  methods  that  render  surfaces  as  sequences 
of  curves  [2].  These  methods  have  deficiencies  in  their  ability  to  guarantee  a 
complete  coverage  of  the  render  surface,  in  their  ability  to  prevent  processing 
the  same  pixel  multiple  times,  or  their  ability  to  produce  an  optimad  surf2tce 
coverage. 

One  of  the  most  powerful  attractions  of  z-buffer  algorithm  is  that  it  can  be 
used  to  render  any  object  if  a  z-value  can  be  determined  for  each  point  in  its 
projection.  No  explicit  intersection  algorithms  need  to  be  written. 


1.4  Load  balancing 

The  load  balancing  strategies  associated  with  an  image-space  visible-surface  al¬ 
gorithm,  like  z-buffer,  can  be  divided  with  respect  to  how  specific  tasks  are 
determined,  into  data  nonadaptive  and  data  adaptive  [9]. 

The  data  nonadaptive  methodology  relies  on  an  initial  decomposition  of  im- 
age  space  unrelated  to  the  input  data.  Many  easily  constructed  image-space 
tasks  of  varying  work  loads  are  assigned  (statically  or  dynamically)  for  parallel 
processing  (dynamic  assignment  typically  produces  better  load  balancing  com¬ 
pared  to  the  static  assignment).  Typically  the  image  is  tiled  into  rectangular 
areas,  and  a  work-queue  approach  is  used.  When  a  processor  starts  rendering, 
it  is  assigned  a  region  and  renders  it.  When  it  finishes  the  region,  it  asks  for 
another  region  and  continues  this  loop  until  the  image  is  done.  The  larger  is  the 
number  of  areas,  the  more  work  is  involved  in  preprocessing,  communication  and 
object  duplication,  but  the  better  the  load  balancing. 

In  the  data  adaptive  case,  the  sizes  of  the  tasks  (the  area  of  the  pixel  regions) 
are  adjusted  according  to  the  input  data  in  an  attempt  to  obtain  better  load 
balancing.  Data  adaptive  schemes  seems  to  be  less  suitable  for  generating  single 


765 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


images  than  data  nonadaptive  algorithms  due  to  an  additional  preprocessing 
overhead.  ° 


1.5  Goals 


We  present  in  the  next  two  sections  a  parallel  algorithm  for  solving  the  problem 
of  drawing  a  two-dimensional  curve  by  distributing  the  computational  effort 
equally  to  a  moderate  number  of  processors.  In  particular,  we  obtain  a  parallel 
algorithm  (described  also  in  [8])  for  scan  converting  general  conics  (ellipses, 
circles,  parabolas,  hyperbolas)  based  on  the  Van  Aken’s  sequential  (incremental) 
algorithm  described  in  [4], 

In  the  third  section  we  present  an  application  of  this  algorithm  for  solving 
the  problem  of  drawing  a  set  of  quadric  surfaces.  We  have  developed  a  parallel 
algorithm  for  quadric  rendering,  which  distributes  the  image  building  and  com¬ 
bining  processes.  Basically,  the  rendering  is  done  using  an  algorithm  of  z-bufFer 
type.  Only  two-dimensional  partial  images  are  communicated  among  processors 
and  not  three-dimensional  volume  data.  The  image  combining  operation  is  es¬ 
sentially  the  composition  operation  applied  to  some  two-dimensional  images  in 

parallel.  The  algorithm  guarantees  a  complete  (dynamic)  coverage  of  the  render 
surface. 

Load  balancing  is  achieved  distributing  equal  parts  of  each  scene  object  to 
the  available  processors  and  render  them  independently  from  each  other  (a  data 
adaptive  strategy). 

The  proposed  algorithm  was  implemented  in  two  different  environments:  on 
a  parallel  machine,  a  T-800  multiprocessor,  using  PARJX  system  for  C  language 
(message  passing  parallel  architecture),  and  on  a  distributed  network,  with  four 
identical  SUN  stations  connected  on  a  network,  using  PVM  3.0  (Parallel  Virtual 
Machine).  The  goal  of  the  tests  is  twofold;  (i)  to  investigate  to  what  extent  the 
theoretical  parallelization  can  be  realized  in  practice  (on  a  parallel  computer  or 
on  a  distributed  system);  (ii)  to  compare  the  proposed  algorithm  with  similar 
(sequential  or  parallel)  algorithms  in  order  to  underline  the  eflBciency  of  the  new 
algorithm  and  the  advantage  of  the  proposed  load  balancing  strategy. 


2  Computing  the  iteration  number  of  the  incremental 
algorithm 


Trying  to  generalize  the  Wright’s  idea  in  the  csise  of  a  planar  continuous  curve, 
we  see  that  the  mathematical  problem  of  dividing  it  into  equal  parts  is  not 
very  simple.  Moreover,  mathematically  equal  (continuous  curve)  parts  are  not 
repr^ented  on  the  display  by  the  same  number  of  pixels.  Therefore,  the  iteration 
numbers  of  the  basic  incremental  algorithm  (Bresenham’s  algorithm  for  circles, 
for  example)  differs  from  one  curve  part  to  another.  ' 

Note  that  we  can  easily  solve  the  problem  of  finding  the  extreme  points  of 
the  curve,  especially  when  we  treat  a  second  degree  curve. 
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We  state  that  for  a  planar  curve  C  generated  by  a  function  g  twice  differen¬ 
tiable  and  with  continuous  second  derivatives,  we  can  find  the  number  of  pixels 
of  the  discretized  curve  which  represents,  at  a  1:1  scale,  the  continues  curve  on 
raster  display. 

Suppose  we  describe  the  curve  C  implicitly  by  a  function  g{x,y)  and  we  draw 
the  discretized  curve  with  an  incremental  algorithm. 

When  the  gradient  vector  has  aslope  |m|  >  1  (in  Figure  l.(a),  for  example, 
from  (ari.yi)  to  (x2,y2),  or  from  (i4,y4)  to  (le.ye)),  from  one  incremental  step 
to  another  the  x  value  is  incremented  with  ±1.  When  the  gradient  vector  has 
a  slope  lm|  <  1  (from  (0:2,  y2)  to  (*4,y4),  for  example)  from  one  incremental 
step  to  another  the  y  value  is  incremented  with  ±1.  If  we  establish  a  direction  of 
plotting  the  curve  g{x,  y)  =  0,  for  example g{x,  y)  <  0  on  the  left  and  g{x,  y)  >  0 
on  the  right,  the  sign  of  the  increment  can  be  properly  chosen. 


Fig.  1.  A  discretized  (scan-converted)  ellipse  (a)  Numbering  its  pixels  (b)  Dividing  it 
into  twenty  parts  with  equal  number  of  pixels 


In  the  conic  case,  the  Van  Aken’s  incremental  algorithm  [4]  (dominated  by 
integer  operations)  can  be  used  in  the  drawing  process. 

In  order  to  find  the  pixel  number  of  the  discretized  curve,  we  must  find  the 
points  (r,  y)  where  the  gradient  is  0  (extreme  pixels),  1  or  -1  (where  a  change  in 
the  incremented  variable  will  be  made).  These  points  can  be  found  by  solving: 


=  0. 


(?l 


in  the  unknown  (z,y). 
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In  the  case  of  a  second  degree  curve,  these  equations  are  not  so  complicated. 
There  are,  maximum,  eight  solutions  (in  Figure  l.(a)  these  solutions  are  denoted 
by  (aJi ,  !/i),  »  =  1, . . . ,  8).  If  the  curve  is  a  hyperbola  or  a  parabola,  we  must  test 
which  solution  points  are  inside  the  screen  domain,  and  we  must  perform  the 
intersections  with  the  edges  of  the  screen  rectangle. 

In  the  case  of  an  arbitrary  function  g,  solving  the  above  equations  can  be 
a  difficult  problem,  and,  therefore,  the  computational  overheewi,  introduced  by 
dividing  the  curve,  can  affect  the  execution  time  of  a  specific  implementation  of 
the  algorithm. 

Using  the  above  remarks  we  can  compute  the  pixel  number  of  the  discretized 
curve  (for  example,  in  the  case  of  Figure  l.(a),  the  pixel  numbers  is  |®2  —  *8|  + 
Ize  -  *4|  +  ll/4  -  !/2|  +  lys  -  yel). 

3  Parallel  algorithm  for  drawing  a  conic 

The  main  difficulty  for  each  processor  involved  in  a  parallel  incremental  algo¬ 
rithm  for  drawing  a  basic  graphic  primitive  is  to  jump  into  the  middle  of  the 
calculations  that  are  normally  performed  sequentially  by  a  single  processor. 

Suppose  we  know  a  point  on  the  discretized  curve  (the  initial  point  in  the 
sequential  incremental  algorithm),  and  we  want  to  draw  simultaneously  all  the 
p-parts  of  a  curve.  The  starting  and  the  ending  pixels  of  a  particular  p-part  of  the 
discrete  curve  are  computed  as  follows.  The  starting  pixel  is  the  ending  pixel  of 
the  previous  p-part  of  the  curve  with  the  exception  of  the  first  part  of  the  curve 
for  which  we  use  the  initial  point  as  starting  pixel.  We  establish  the  number  of 
pixels  of  the  current  p-part  of  the  curve  (the  same  as  for  the  previous  p-part  of 
the  curve,  with  the  exception  of  the  last  p-part).  After  that,  we  find  the  ending 
point:  first,  we  get  an  ar  or  y  value  using  the  above  number  and  the  number  of 
pixels  to  be  draw  in  each  direction  from  the  starting  pixel;  second,  we  get  the 
other  coordinate,  y  or  x,  solving  the  equation  g{x,y)  =  0.  In  Figure  l.(b)  the 
starting  and  ending  points  are  emphasized  for  the  ellipse  of  Figure  l.(a)  and 
p  =  20. 

After  getting  started,  the  logic  for  each  processor  is  identical  to  the  logic 
used  in  the  sequential  (incremental)  algorithm  except  for  a  small  change  in  the 
stopping  criterion  (we  have  arrive  to  the  ending  point  or  not). 

Figure  2  shows  how  it  can  be  obtained  curves  like  conics  or  graphs  of  some 
functions  using  three  drawing  processes. 

In  practice,  an  implementation  of  the  parallel  algorithm  on  a  distributed- 
memory  architecture  will  be  efficient  only  if  the  creation  of  an  image  part  will 
be  more  expensive  in  time  than  the  additional  communications  required  to  send 
the  partial  images  and  the  initial  data  between  the  processes.  Therefore,  we 
expect  that  the  efficiency  of  the  parallel  algorithm  can  be  proved  when  we  deal 
with  a  large  set  of  curves. 

Suppose  we  have  a  collection  of  quadrics  (ellipsoids,  spheres,  cons,  cylinders, 
paraboloids  or  hyperboloids)  which  must  be  represented  in  a  two-dimensional 
image  using  a  parallel  or  a  perspective  projection. 
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Fig.  2.  Three  processes  are  working  to  produce  an  image  of  (a)  an  ellipse  (b)  a  parabola 
(c)  an  hyperbola  (d)  a  sinusoidal  function 


Note  that  the  intersection  of  the  quadric  with  a  plane  is  a  conic.  In  the  case  of 
a  parallel  projection,  we  can  use  the  parallel  algorithm  to  represent  a  wire-frame 
model  of  the  scene  composed  by  these  quadrics.  An  example  of  a  such  scene  is 
given  in  Figure  3.  If  the  intersection  plane  is  parallel  with  the  projection  plane, 
the  (parallel  or  perspective)  projection  of  the  intersection  curve  is  also  a  conic. 
We  can  exploit  this  observation  in  a  z-buffer  like  algorithm. 


Fig.  3.  Wire-frame  model  of  a  scene  composed  by  some  quadrics:  (a)  the  5th  part  of 
the  image  constructed  by  one  processor  (b)  the  find  image 
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4  Rendering  a  quadric 

Theoretically,  we  can  generate  an  two-dimensional  image  of  a  quadric  using  the 
intersection  curves  of  the  quadric  with  a  set  of  parallel  planes  with  the  projection 
plane. 

The  main  problem  is  the  distance  between  two  consecutive  intersection  planes, 
i.e.  how  we  can  choose  this  distance  thus  that  no  pixels  between  two  curves  will 
appear,  and  the  number  of  the  overlapped  pixels  is  minimized.  The  distance 
between  consecutive  planes  varies  in  relation  with  the  quadric  form. 

We  start  from  a  front-end  plane.  At  a  particular  step,  we  can  compute  the 
optimal  distance  using  the  difference  between  the  extreme  pixels  of  the  current 
intersection  curve  and  those  of  the  last  intersection  curve.  Each  difference  be¬ 
tween  corresponding  pairs  must  be  no  more  that  one,  and  at  least  one  such 
difference  must  be  nonzero.  If  this  condition  is  not  fulfilled  another  intersection 
plane  must  be  selected  (closer  or  farther  from  the  last  one). 

Each  processor  of  the  parallel/ distributed  system  with  p  processors  is  respon¬ 
sible  for  drawing  the  image  of  a  p-part  of  each  scene  object;  it  draws  effectively 
a  p-part  of  each  section  curve.  For  any  pixel  of  a  section  curve,  a  z-value  can  be 
recovered  from  the  equation  of  the  intersection  plane.  The  extreme  pixels  of  the 
intersection  curves  and  the  distances  to  the  next  section  planes  must  be  found 
by  each  processor  (insignificant  relative  to  the  total  computation  time  when  the 
number  of  processors  is  small). 

The  Warn’s  lighting  model  [4]  was  used  in  our  implementation  of  the  rende¬ 
ring  process. 

Suppose  that  the  input  data  for  each  processor  are  the  set  of  coefficients  of  all 
quadrics.  Then  the  local  memory  of  each  processor  must  have  only  the  capaw:ity 
to  store  a  z-buffer,  a  local  array  of  pixel  colors,  and  the  coefficients. 

Theoretically,  each  z-buffer  and  each  array  of  pixel  colors  must  be  sended  to 
the  processor  which  display  the  final  image.  In  practice,  it  is  more  convenient  to 
send  only  the  z-values  and  the  colors  of  the  pixels  which  have  been  drawn  (for 
example,  the  component  pixels  of  the  3rd  part  of  the  sphere  from  Figure  4(a)). 

Note  that  the  proposed  algorithm  do  not  split  the  z-buffer  between  the  pro¬ 
cessors  (like  the  parallel  z-buffer  algorithms  presented  in  [5]  or  [9]).  Instead,  each 
quadric  is  divided  so  that  the  number  of  pixels  treated  be  each  processor  is  the 
same.  Note  also  that  the  data  storing  requirements  for  each  processors  are  least 
that  in  the  case  of  applying  the  z-buffer  algorithm  for  a  polyhedral  approxima¬ 
tion  (in  particular,  in  the  case  of  Figure  5,  at  least  14000  polygons  are  necessary 
for  obtaining  a  similar  image). 

We  expect  that  the  computational  overhead  introduced  by  the  dividing  pro¬ 
cedure  of  each  object  into  equal  parts  will  have  a  maximum  negative  effect  on 
the  efficiency  of  the  parallel  algorithm  if  the  patches  curves  will  be  very  small  (in 
this  case,  the  incremental  algorithm  has  a  small  number  of  iterations  which  must 
be  distributed  between  the  available  processors).  In  order  to  analyze  a  such  case 
we  have  constructed  the  model  presented  in  Figure  6.  As  the  tests  will  prove,  this 
computational  overhead  is  no  more  expensive  in  time  than  15%  of  the  rendering 
process. 
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Fig.  4.  Drawing  a  sphere:  (a)  the  final  image  (b)  the  image  produced  by  a  processor 
in  the  case  of  a  group  of  p  =  3  processors 


Fig.  5.  Image  composed  of  natural  quadrics:  (a)  the  final  image  (b)  half-image  created 
by  one  processor 


The  resulting  data  from  all  the  processors  must  be  combined  to  form  the  fi¬ 
nal  image.  Combining  can  be  done  algorithmically  in  several  different  ways.  The 
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Fig.  6.  Geometric  models  of  some  well-known  parametrical  surfaces  using  small 
spheres,  (a)  the  final  image  (b)  half-image  created  by  one  processor 
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choice  of  one  method  of  combining  over  the  other  can  only  be  determined  by  the 
size  of  the  data,  the  number  of  processors,  the  topology  of  the  inter-connection 
network  and  the  routing  strategy  used.  In  a  centralized  scheme  all  processors 
send  their  local  images  to  one  processor.  Although  simple  to  implement,  this 
strategy  is  inefficient  in  the  presence  of  a  large  number  of  processors.  The  rea¬ 
son  lies  in  the  network  congestion  that  arises  when  all  processors  attempt  to 
send  their  data  to  a  single  distinguished  processor.  Some  comparative  studies 
[12]  of  the  different  methods  of  combining  conclude  that  for  fewer  number  of 
processors,  a  simple  tree  scheme  is  more  suitable.  Therefore,  we  have  employ  a 
strategy  that  take  advantage  of  the  inherent  tree-like  structure  of  composition 
(image  combining  belongs  to  a  class  of  associative  operations  which  includes,  for 
example,  addition).  In  logp  steps  the  final  image  can  be  created. 


5  Experimental  results 

We  denote  by  To  the  running  time  (in  seconds)  of  the  sequential  implementa¬ 
tion  of  the  basic  (Van  Aken’s)  incremental  algorithm  (with  one  process  on  one 
processor),  by  Ti  the  running  time  of  the  sequential  implementation  of  the  con¬ 
current  algorithm  (with  one  process  on  one  processor),  and  by  Tp  the  running 
time  of  the  parallel  or  distributed  implementation  of  the  parallel  algorithm  with 
p  processes  distributed  to  p  processors.  Then  the  efficiency  of  the  concurrent 
algorithm  implemented  in  a  sequential  mode  compared  with  the  classical  incre¬ 
mental  sdgorithm  is  given  by  the  relationship  =  To/Ti,  the  efficiency  of 

the  implementation  of  the  concurrent  algorithm  on  p  processors  of  a  (simulated 
or  real)  parallel  machine  is  given  by  the  relationship  =  Ti/{pTp),  and  the 
efficiency  of  the  proposed  algorithm  using  a  (simulated  or  real)  parallel  machine 
is  given  by  the  relationship  Ep  =  To/{pTp)  (£'^“'‘'*'  on  the  parallel  machine,  and 
on  the  distributed  network). 

We  have  use  a  master-slave  computing  scheme.  The  master  starts  the  slave 
processes  and,  if  it  is  necessary,  it  send  them  some  initial  data.  Each  slave  fini¬ 
shes  the  job  and  sends  (using  some  intermediate  processes  according  to  a  tree 
structure)  the  result  back  to  the  master  which  will  display  the  final  image.  The 
algorithm  running  on  each  node  processor  consists  of  two  main  parts;  the  first 
is  an  outer  body  containing  the  counterparts  of  the  message-passing  calls,  and 
the  second  is  an  inner  loop  for  the  rendering  algorithm. 

In  order  to  design  the  communication  procedures,  we  can  adopt  one  of  the 
following  strategy:  (1)  the  master  computes  the  section  curves  and  the  associated 
depths,  and  send  the  equation  of  each  section  curve,  the  starting  and  the  ending 
points  of  the  i  part  of  the  curve  to  the  ith  processor,  and  e2ich  slave  is  occupied 
only  with  the  render  process;  (2)  the  master  computes  the  equations  of  the 
section  curves  and  broadcast  them  with  the  associated  depths,  and  each  slave 
must  find  the  starting  and  the  ending  points  of  the  corresponding  part  of  the 
curve,  and  renders  it;  (3)  the  master  broadcasts  the  set  of  quadrics  to  be  render 
and  each  slave  computes  the  section  curves,  find  its  part  of  the  curve,  and  renders 
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it;  (4)  each  slave  read/generates  the  set  of  quadrics  to  be  render,  generates  the 
section  curves,  and  renders  the  corresponding  parts. 

Each  above  mentioned  strategy  has  been  tested.  For  our  test  images  presented 
in  the  last  section,  the  first  two  strategy  were  too  expensive  in  time,  since  the 
communication  time  has  overpass  the  computing  time  (especially  in  the  case  of 
the  implementation  on  a  workstation  network).  Therefore,  the  following  results 
refer  to  an  implementation  of  the  third  strategy  in  the  case  of  Figures  3,  4,  5,  and 
of  the  fourth  strategy  in  the  case  of  Figure  6  (in  this  particular  case,  it  is  more 
easier  to  generate  the  sequence  of  quadrics  than  to  send  each  quadric  equation 
from  meister  to  slaves). 

Experiment  1:  Computational  efSciency.  We  have  study  the  influence  on  the 
running  time  due  to  the  computational  overhead  for  splitting  the  curves  in  equal 
parts.  The  efficiency  of  the  concurrent  algorithm  implemented  in  a  sequential  way 
(compared  with  the  classical  incremental  algorithm)  is  almost  1  if  the  discretized 
conics  have  a  number  of  pixels  comparable  with  the  dimensions  of  the  final  image 
(see  Table  1).  Note  that  the  last  column  corresponds  to  the  worst  case  when  the 

Table  1.  Computational  efficiency  of  the  algorithm  applied  to  the  models  from  Figs.  1-6 


No.  curves  1  130  226  1307  15657  233993 

98%  95%  94%  99%  99%  86% 


section  curves  are  very  small  (Figure  6).  If  we  have  a  more  complicated  curve 
(or  a  set  of  curves)  than  a  second  degree  one  (ones),  decreases  since  it  is 

possible  that  the  partial  differential  equations  are  not  linear  and  we  must  solve 
them  with  a  Newton-type  procedure. 

Experiment  2:  Efficiency  of  the  parallel  implementation.  Comparing  the  run¬ 
ning  time  of  the  code  for  the  algorithm  on  a  single  one  processor  (without  com¬ 
munication  overhead)  with  those  on  more  processor,  we  have  obtained  the  results 
which  are  presented  in  Table  2. (a).  As  it  can  be  seen,  the  parallel  implementa- 

Table  2.  Parallel  and  distributed  implementation  efficiency:  (a)  EP“’'  for  different 
values  of  p  (b)  E3  for  different  computer  circhitecture 


No. 

curves 


1 

130 

226 

401 


Er" 


Er 

Er 

Er 

Er 

Er 

ipPvm 

^3 

irPartj 

^3 

29.79% 

21.00% 

14.00% 

10.61% 

8.45% 

6.56% 

12% 

29% 

78.87% 

71.82% 

64.93% 

54.85% 

48.25% 

43.51% 

51% 

78% 

85.29% 

80.99% 

75.99% 

73.77% 

67.81% 

60.68% 

67% 

84% 

86.43% 

81.28% 

76.57% 

75.25% 

69.52% 

65.23% 

68% 

85% 

tion  is  effective  when  the  code  is  used  for  drawing  a  large  set  of  curves.  Note 
that  the  parallel  implementation  of  the  rendering  algorithm  is  not  useful  when 
we  have  a  small  number  of  curves  (in  the  case  of  one  single  curve,  the  running 
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time  increases  with  the  processor  number).  Plotting  the  results  from  Table  2. (a) 
(see  Figure  7),  we  can  draw  the  conclusion  that,  if  the  curves  set  is  large  (i.e.  of 
at  least  hundreds  order),  the  efficiency  of  the  parallel  implementation  decrease 
near  linearly  with  the  number  of  processors.  The  final  conclusion  is  that  the 
efficiency  of  the  proposed  algorithm  increases  with  the  number  of  curves  and 
decreases  with  the  number  of  processors. 


1  curve 

130  curves  H — 
226  curves  Q  ■ 
401  curves  ^ — 


Fig.  7.  Algorithm  implementation  efficiency  relative  to  the  number  of  processors  and 
the  number  of  curves 


Experiment  3:  Communication  overhead  in  parallel  and  distributed  imple¬ 
mentations.  The  time  results  and  the  efficiency  results  of  the  code  implemented 
on  the  transputer  machine  with  the  ones  of  the  code  implemented  on  the  dis¬ 
tributed  system  were  compared.  The  results,  for  the  case  of  3  processors,  are 
presented  in  Table  2.(b)  (we  have  obtain  also  efficiency  values,  for  the  distributed 
system  with  p  =  2  processors,  between  69%  for  Figure  6  to  88%  for  Figure  5  and, 
for  p  =  3,  between  53%  to  82%).  The  difference  are  mainly  due  to  the  differ¬ 
ent  rates  of  communication  procedures.  The  efficiency  results  on  the  distributed 
system  are  encouraging  the  utilization  of  such  system  in  rendering  problems. 

Experiment  4:  Quality  of  the  image  produced  by  the  proposed  rendering  al¬ 
gorithm.  Different  sets  of  quadrics  have  been  considered  and  the  resulting  images 
and  running  times  were  compared  with  the  ones  produced  using  other  rendering 
techniques:  ray-tracing,  depth-sort  and  z-buffer  algorithms  for  polyhedral  ap¬ 
proximations.  Due  to  some  shading  techniques,  the  images  produced  with  the 
ray-tracing  algorithm  (with  the  POVRAY’s  implementation,  for  example)  are 
more  realistic  than  those  produced  with  the  proposed  rendering  algorithm,  but 
the  running  times  are  in  favor  of  the  second  algorithm.  Depth-sort  technique 
is  impracticable  in  the  case  of  a  large  number  of  polygons  of  the  approximate 
model  (as  the  model  from  Figure  6  requests,  for  example).  In  order  to  create  an 
image  similar  to  that  of  the  sphere  from  Figure  4,  we  have  use  the  (sequential) 
z-buffer  algorithm  applied  to  a  polyhedral  model  with  14400  vertex  and  141161 
polygons;  although  the  discretization  was  very  fine,  one  can  distinguish  the  con¬ 
tours  of  each  polygon  (the  same  remark  can  be  made  for  the  model  from  Figure 
8).  The  running  times  are  also  advantaging  the  new  rendering  algorithm. 
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Fig.  8.  P^ial  image  produced  by  one  processor  in  an  implementation  of  the  classic 
parallel  z-buffer  algorithm  [5],  applied  to  a  polyhedral  model  associated  to  the  set  of 
quadncs  from  Figure  6.(a),  with  a  data  nonadaptive  strategy,  and  with  a  static  assign¬ 
ment  of  the  processors  (the  final  image,  which  has  206145  vertex,  with  81  vertex  for 
a  sphere,  and  196097  polygons,  was  created  in  four  times  more  time  than  for  the  par- 
aUel  rendenng  algorithm  which  treats  233993  section  curves;  the  curve  coefficients  are 
more  convenient  to  store  than  the  polygonal  informations).  This  half-image  is  produced 
much  faster  than  the  other  half  due  to  an  load  imbalance  of  the  two  processors. 
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Abstract.  Discrete  spherical  representations  have  shown  to  be  appro¬ 
priate  for  modeling  and  recognizing  3D  objects.  The  SAI  (Simplex  An¬ 
gle  Image)  is  an  evolved  spherical  representation  with  several  invariance 
properties,  which  enables  the  possibility  to  work  with  non-convex  ob¬ 
jects.  This  paper  describes  and  compares  the  efficiency  of  parallelization 
strategies  for  the  SAI  using  both  a  distributed  (message-passing)  and 
a  shared  memory  approach.  Several  experiments  have  been  carried  out 
on  workstations  and  on  a  multiprocessor.  For  each  experiment,  we  have 
considered  different  sizes  of  files  corresponding  to  3D  scattered  points 
representing  objects,  a  variable  degree  of  tessellation  in  the  unit  Gaus¬ 
sian  sphere,  and  a  different  number  of  working  processes. 


1  Introduction 

One  of  the  main  objectives  of  a  vision  system  is  to  achieve  a  good  way  to  represent 
and  model  the  acquired  data.  Nowadays,  it  is  assumed  that  a  good  modeling 
system  should  have  at  least  the  following  properties  ([1],[2],[3]); 

-  Expressive  richness. 

-  Stability  (against  errors  or  presence  of  noise  in  the  input  data). 

-  Tteatment  of  occlusions. 

-  Physical  attributes  of  the  object’s  shape. 

-  Efficiency. 

Among  the  different  approaches  of  modeling  3D  objects  from  real  data  (for  ex¬ 
ample  see  [4], [5], [6], [7], [8]),  a  promising  one  is  the  model  based  on  the  Gaussian 
sphere  proposed  by  Ikeuchi  et  al.  ([9]-[16]).  This  model  involves  the  definition 
of  an  homeomorphism  from  the  input  3D  cloud  of  points  to  a  regular  trian¬ 
gular  mesh  obtained  by  the  subdivision  of  an  icosahedron  of  radius  unity  with 
a  previously  fixed  degree  of  resolution.  The  subdivision  procedure  constructs  a 
surface  similar  to  the  approximation  of  the  sphere  surface  with  radius  one  by 
triangular  facets.  Each  one  of  the  nodes  of  the  surface  of  the  sphere  stores  the 
details  of  the  local  geometry  that  has  the  input  data  on  the  corresponding  node 
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of  the  deformed  mesh.  Ikeuchis  s  model  stores  the  Simplex  Angle  Image  (SAI), 
The  repre.sentation  obtained  by  this  procedure  can  be  used  on  systems  like  ob¬ 
ject  prototyping  in  CAD/CAM  environments,  .sensorial  perception  equipments 
or  object  recognition  systems. 

This  paper  proposes  several  parallelization  algorithms  for  the  computation 
of  the  SAI  using  a  shared  memory  and  a  distributed  memory  (MPI  based)  ap¬ 
proaches.  This  work  focuses  on  the  experimental  results  and  on  its  interpretation 
in  relation  with  the  factors  involved  in  the  parallel  algorithms.  The  rest  of  the 
paper  is  organized  as  follows.  Section  2  introduces  the  notation  and  outlines  the 
steps  describing  the  sequential  algorithm.  Descriptions  and  pseudocode  corre¬ 
sponding  to  considered  parallel  strategies  are  given  in  Section  3.  Experimental 
results  and  a  global  comparison  of  studied  SAI  algorithms  are  shown  in  Section 
4.  Finally,  Section  5  presents  the  conclusions  and  gives  some  remarks  on  our 
future  work. 

2  Description  of  the  SAI  Algorithm. 

This  section  summarizes  the  notation  for  the  SAI  and  describes  the  main  steps 
in  the  sequential  algorithm. 


2.1  Notation 

The  following  terms  will  be  introduced  to  be  used  in  the  rest  of  the  paper: 

SAI:  Simplex  Angle  Image.  The  SAI  angle  represents  the  value  of  the  curvature 
between  the  edge  formed  by  one  node  of  the  regular  mesh  and  one  of  his 
neighbors,  and  the  plane  formed  by  the  node  and  the  other  two  neighbors, 
once  the  regular  mesh  is  adjusted  to  the  3D  input  data.  The  corresponding 
spherical  representation,  storing  the  SAI  angle  for  each  one  of  its  nodes  is 
called  SAI.  This  representation  is  invariant  by  translation  and  scaling  of  the 
original  object. 

Acronyms  for  the  SAI  phases  are: 

CRE:  Creation  of  the  regular  mesh  at  level  of  resolution  specified. 

CLP:  Computation  of  the  closest  input  point  for  each  node  of  the  regular  mesh. 

DEF:  Deformation  of  the  regular  mesh  in  order  to  adjust  the  shape  to  the  input 
3D  points. 

SAIA:  Computation  stage  of  the  SAI  angle  for  each  node  of  the  Gaussian 
sphere. 

n:  Number  of  3D  points  of  the  input  data.  In  this  work,  we  consider  objects 
with  a  maximum  of  83.171  points. 

RD:  Resolution  degree  of  the  regular  mesh.  In  our  experiments,  we  consider  a 
maximum  level  of  resolution  equal  to  5. 

m:  Number  of  nodes  of  the  regular  mesh.  This  number  depends  on  the  HD 
according  to  the  following  expression:  m  =  20  •  4^^.  where  RD  =  0, 1. 2, ... 
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N P:  Number  of  processes  considered  in  the  parallel  implementations.  In  order 
to  haA-e  an  uniform  workload  among  processes,  this  number  must  divide  20 
(number  of  facets  of  the  initial  approximation  corresponding  to  the  spherical 
mesh ) . 

2.2  Sketch  of  the  Algorithm 

A  sketch  of  the  sequential  algorithm  for  the  definition  of  the  SAI  corresponding 
to  an  object,  given  by  a  file  with  3D  points  obtained  from  range  data,  follows. 

1.  Construct  the  initial  regular  tessellated  sphere  which  wraps  around  the  set  of 

3D  points  defining  the  object.  This  involves  three  substeps:  approximate  the 
sphere  by  a  20-face  icosahedron;  tessellate  recursively  each  one  of  its  faces 
into  n  small  triangular  faces,  where  n  (n  =  =  1,2,...)  and  define  the 

final  tessellation  by  taking  the  dual  of  the  previous  triangulation,  yielding  a 
geodesic  dome  with  the  same  number  of  nodes. 

2.  Determine  for  each  node  (face)  of  the  geodesic  dome,  the  3D  closest  point 
in  the  data  file  corresponding  to  the  object  being  modeled.  This  is  the 
most  time-consuming  stage  of  the  SAI  algorithm,  and  its  time  complexity  is 
0(mn). 

3.  Perform  an  iterative  deformation  of  the  geodesic  dome  while  the  average  sum 
of  local  errors  exceeds  a  fixed  threshold.  Every  local  error  is  defined  by  the 
distance  between  a  node  of  the  sphere  and  its  corresponding  closest  point, 
as  determined  in  previous  step.  Each  node  is  deformed  to  match  the  object, 
according  to  an  approximation  force  Fo  and  a  curvature  force  Fg,  both  de¬ 
fined  for  the  actual  position  Pj  of  each  node  at  time  t.  The  new  position  of 
the  node  at  time  is  given  by:  Pt+i  =  Pt  +  Fo  +  Fg  +  d{Pt  —  P,-i),  where 
d  represents  the  damping  coefficient  affecting  the  rate  of  convergence. 

4.  Compute  the  discrete  curvature  (simplex  angle)  for  each  node  of  the  de¬ 
formed  mesh.  Finally,  map  each  simplex  angle  onto  the  corresponding  node 
of  the  regular  mesh,  obtained  at  the  end  of  the  first  step.  The  resulting 
structure  is  called  Simplex  Angel  Image  (SAI). 

By  analyzing  the  stages  in  the  SAI  representation,  we  found  that  it  can  be 
efficiently  parallelized.  Some  remarks  have  to  be  taken  into  account:  local  nature 
of  the  deformation,  completion  condition  for  the  deformation  depends  on  a  global 
error,  and  absence  of  a  node  topology  which  could  define  independent  deforma¬ 
tion  regions  on  the  mesh.  These  facts  determine  two  implicit  synchronization 
phases,  which  bound  the  improvements  in  the  parallelization. 

3  Parallel  Strategies 

Several  parallelization  strategies  have  been  considered  for  the  SAI.  Implemen¬ 
tations  have  been  carried  out  using  a  message-passing  interface  standard  (MPI) 
and  using  shared  memory  primitives  on  several  workstations  and  on  a  multipro¬ 
cessor  with  ten  processors. 
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MPI  represents  the  first  attempt  to  standardize  the  communication  library 
for  distrilDuted-  memory  com]3uting  systems  ([17], [18]).  One  of  the  major  goals 
of  MPI  is  to  provide  a  widely  portable  and  ease-of-use  programming  library 
without  sacrificing  the  performance.  Considered  parallel  implementations  for 
the  SAI  using  MPI  follow.  In  all  these  strategies,  an  equivalent  data  workload 
for  all  working  processes  (on  the  same  or  different  machines)  is  achieved. 

Collective  communication  MPI:  It  involves  simultaneously  a  group  of  pro¬ 
cesses  at  a  time  (point-to-point  communication  involves  only  two  processes). 
The  information  corresponding  to  complete  nodes  in  the  sphere  is  communi¬ 
cated  among  processes  which  work  with  contiguous  data-dependent  portions 
or  the  sphere.  The  following  MPI  primitives  have  been  used  in  our  imple¬ 
mentation:  BROADCAST.  ALLGATHER  and  REDUCE.  A  sketch  of  the  deformation 
step  corresponding  to  this  strategy  is: 

FOREACH  process  DO 

WHILE  global_average_error  >  threshold  DO 

Compute  in  parallel  the  new  position  of  each  node 
Compute  in  parallel  the  positions  corresponding  to  the 
nodes  of  the  "portion"  of  the  spherical  mesh 
associated  with  each  process. 

Compute  in  parallel  the  local  error  for  each  node 
COMMUNICATION  and  SYNCHRONIZATION  ( MP I _ ALLGATHER ) : 

local  spherical  "portion"  of  sphere  corresponding  to 
each  process 

GLOBAL  REDUCTION  (MPI.REDUCE) :  local  errors  of 

processes  are  transformed  into  a  global  average 
error 

END 

END 

Optimized  collective  communication  MPI:  It  differs  from  previous  strat¬ 
egy  in  the  fact  tha  t  only'  the  centers  of  nodes  in  the  sphere  are  communicated 
among  dependent  processes.  These  values  are  the  only  necessary  ones  to  ex¬ 
ecute  a  new  iteration  in  the  deformation  step. 

Shared-memory:  Another  different  parallel  approach  is  based  on  the  use  of 
shared  memory^  primitives  of  a.  multiprocessor  architecture.  This  strategy 
supports  the  creation  of  shared  data  accessible  from  every’  process.  Thus,  im¬ 
plicit  communication  is  performed  by  global  memory  acce.sses.  Semaphores 
are  used  to  restrict  the  access  to  shared  data  and  as  a  synchronization  tool. 
In  this  implementation  two  shared  data  are  defined:  the  spherical  mesh  (rep¬ 
resented  as  an  array  of  nodes,  where  each  node  stores  the  actual  position  of 
its  center,  the  actual  position  of  its  three  neighbor  nodes  and  some  other 
local  information),  and  the  global  average  error  at  each  iteration  in  the  de¬ 
formation.  A  sketch  of  the  deformation  loop  corresponding  to  this  strategy 
is: 
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FOREACH  process  DO 

WHILE  global  average  error  >  threshold  DO 

compute  in  parallel  the  new  position  of  each  node 
SYNCHRONIZATION:  actualize  the  global  spherical  mesh 
evaluate  in  parallel  the  local  error  for  each  node 
SYNCHRONIZATION  and  MUTUAL  EXCLUSION:  compute  global 
average  error 

END 

END 

4  Experimental  Results 

All  the  SAI  strategies  (sequential,  shared  memory,  collective  MPI  and  optimized 
collective  MPI)  considered  in  this  paper  have  been  implemented  on  several  work¬ 
stations  (SGI5  175  MHz  IP21  processor)  and  on  a  multiprocessor  with  ten  pro¬ 
cessors  (SGI5  200  MHz  IP  19  processor).  Other  parallel  MPI- based  algorithms 
have  also  been  implemented  (i.e.  point-to-point  blocking  and  point-to-point  non- 
blocking  communication  schemes)  but  the  obtained  execution  times  were  consid¬ 
erably  worse  and  have  been  discarded.  The  following  subsections  show  the  visual 
results  corresponding  to  surface  reconstruction  of  two  objects  used  in  our  exper¬ 
iments,  several  figures  corresponding  to  each  one  of  studied  SAI  algorithms,  and 
two  tables  which  globally  compare  the  performance  of  all  strategies. 

4.1  Input  data  and  surface  approximation 

Our  surface  representation  is  a  discrete  connected  mesh  that  is  honieomorphic  to 
a  sphere.  The  connectivity  of  the  mesh  is  such  that  each  node  has  exactly  three 
neighbors,  and  the  total  number  of  mesh  nodes  depends  on  its  resolution  degree. 
Given  a  set  of  3D  data  points  from  range  images,  cajjtured  by  a  range  laser 
and  a  3D  manual  digitizer,  the  mesh  representation  is  constructed  as  explained 
in  Subsection  2.2.  The  mesh  is  always  a  closed  surface.  Figs.  1  and  2  show 
two  examples  of  object  surface  reconstruction  from  3D  data.  In  our  experiments, 
several  objects  with  different  shape  and  resolution  have  been  considered.  For  the 
sake  of  simplicity,  figures  and  tables  are  only  referred  to  three  objects:  object 
objl(315  points),  object  obj2(2y28  points)  and  object  obj3  (28656  points).  The 
figures  which  refer  to  experiments  that  use  only  one  object  are  referred  to  obj3, 
which  is  the  most  dense  one.  We  should  also  point  out  that  the  deformable  mesh 
is  more  suitable  for  representing  rather  smooth  and  convex  objects. 

4.2  Sequential  algorithm 

Figs.  3  and  4  present  experimental  results  corresponding  to  the  sequential 
implementation  for  the  SAI  model.  Fig.  3  shows  the  percentage  of  execution 
time  spent  on  each  phase  of  the  algorithm  as  a  function  of  the  mesh  resolution 
degree.  Fig.  4  presents  the  percentage  of  execution  time  for  different  object 
resolutions  as  a  function  of  each  algorithm  stage. 
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Fig.  1.  (a)  Original  scattered  point  set  of  a  bottle;  (b)  Deformed  mesh  adjusted  to  the 
original  point  set  of  the  bottle. 


Fig.  2.  (a)  Original  scattered  point  set  of  a  synthetic  object;  (b)  Deformed  mesh  ad¬ 
justed  to  the  original  point  set  of  this  geometrical  object. 
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Resolution  degree 


Fig.  3.  Execution  time  of  the  sequential  algorithm  for  obj3  by  varying  the  resolution 
degree. 


generation  calculation  deformation  computation 
Algorithm  stage 

Fig.  4.  Percentage  of  execution  time  corresponding  to  each  algorithm  stage  for  different 
objects  in  the  sequential  version. 
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4.3  Shared  Memory 

Figs.  5  and  6  illustrate  the  behavior  of  the  shared  memory  version  of„SAI.  The 
parameter  NP  must  divide  the  size  of  initial  sphere  approximation  (20  triangular 
facets),  and  we  have  plotted  the  representations  for  the  values  A’P=1,2,.5.10  and 
20.  In  Fig.  .5,  note  that  the  total  execution  time  (expre.s.sed  in  logarithm  scale) 
corresponding  to  A' P=5  is  worse  with  a  low  resolution  degree  but  it  improves  as 
RD  increases.  Best  results  for  high  resolution  are  obtained  using  more  processes. 
Fig.  6  shows,  for  a  resolution  R.D=3,  how  the  percentage  of  execution  time 
decreases  as  the  number  of  processes  increases.  It  is  interesting  to  remark  that 
CLP  stage  consumes  more  than  85%  of  execution  time  when  NP=1  or  NP=2. 


— 0— NPsI 
-O— NP=2 
— X— NP.5 
— NPslO 
— it^t\IP=20l 


Fig.  5.  Execution  time  of  obj3  by  varying  the  resolution  degree  for  different  number 
of  working  processes  using  the  shared  memory-  algorithm. 


Algorithm  stage 


Fig.  6.  Distribution  of  execution  time  among  the  algorithm  stages  for  different  number 
of  processes  using  the  shared  memory  version  (RD=:i). 
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4.4  Collective  Communication  MPI  Algorithm 

Figs.  7  and  8  are  respectively  equivalent  to  Figs.  5  and  6,  but  using  message 
passing  (MPI)  concurrency.  We  can  see  in  Fig.  7  that  the  total  execution  times 
for  different  number  of  processes  are  worse  than  the  correspondent  values  of  the 
shared  memory  algorithm  (compare  with  values  of  Fig.  5).  Similar  behavior  is 
observed  in  Fig.  8  (now,  we  have  considered  RD=4)  with  relation  to  Fig.  6. 
A  remarkable  fact  in  Fig.  8  is  the  high  percentage  of  mesh  deformation  phase 
(next  to  50  %)  for  NP=10,  which  is  mainly  due  to  the  increase  of  communication 
among  processes. 


Resolution  degree  (RD) 


Fig.  7.  Execution  time  of  obj3  by  varying  the  resolution  degree  for  different  number 
of  working  processes  using  the  collective  MPI  version. 


Algorithm  sr^ge 


-0-WP.1 
-O— NP=2 
NPt4 
— 4*— WP-5 
— NPOO 


Fig.  8.  Distribution  of  execution  time  among  the  algorithm  stages  for  different  number 
of  processes  using  the  collective  MPI  version  (obj3.  with  /?D=4). 
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4.5  Optimized  Collective  Communication  MPI  Algorithm 

Figs.  9,  10  and  11  show  the  effect  of  the  optimization  in  the  previous  collective 
MPI  strategy.  As  pointed  out  in  Section  3,  this  strategy  significantly  reduces 
the  amount  of  data  which  are  communicated  among  processes  (only  the  infor¬ 
mation  of  a  point  is  passed  to  neighbor  nodes,  instead  of  the  whole  information 
corresponding  to  a  node  structure).  With  this  approach,  as  shown  in  Fig.  11, 
the  execution  time  of  each  algorithm  phase  is  reduced  approximately  by  a  factor 
of  three  with  respect  to  the  other  considered  MPI  strategy  (except  for  the  mesh 
generation  stage). 


Fig.  9.  Execution  time  of  obj3  by  varying  the  resolution  degree  for  different  number 
of  woi'king  processes  using  the  optimized  MPI  version. 


Aigorttm 


-HK— NPtlO 


Fig.  10.  Distribution  of  execution  time  among  the  algorithm  stages  for  different  number 
of  processes  using  the  optimized  MPI  version  (obj3,  with  RD=4). 
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Fig.  11.  Comparison  between  collective  and  optimized  MPI  strategies  using  obj3 
(28656  points),  NP=a  and  RD=b. 


4.6  Global  comparison 

The  graph  in  Fig.  12  compares  the  total  execution  time  (tTOTAb)  a  function 
of  the  resolution  degree  for  the  different  implemented  SAI  algorithms.  The  use  of 
a  logarithm  scale  to  represent  txoTALi  allows  to  better  display  the  performance 
trends.  A  consequence  of  this  figure  is  that  for  the  considered  object  (obj3,  28656 
points)  when  the  resolution  degree  increases  significantly,  a  better  performance 
is  achieved  for  the  optimized  collective  communication  MPI  strategy  (see  this 
tendency  for  RD=^). 

Table  1  expresses  a  relation  of  the  execution  time  with  respect  to  n  (consid¬ 
ered  objects  sizes  represent  different  orders  of  magnitude,  and  can  be  approx¬ 
imately  determined  by  multiplying  the  size  of  previous  object  by  10),  the  four 
main  SAI  stages  and  the  four  considered  algorithms.  From  the  overall  time  re¬ 
sults,  we  note  that,  with  independence  of  the  object  size,  for  the  stages  of  mesh 
generation  and  closest  point  calculation,  the  optimized  MPI  algorithm  gives  the 
best  results.  By  the  other  side,  the  shared  memory  strategy  behaves  better  for 
the  stages  of  mesh  deformation  and  the  calculation  of  the  SAI  angle.  It  is  inter¬ 
esting  to  remark  that  the  best  execution  times  for  the  closest  point  calculation 
stage  using  optimized  MPI  algorithm  are  due  to  the  lack  of  interaction  among 
processes  when  calculating  3D  distances  (each  process  has  its  own  copy  of  the  3D 
data) ,  while  in  the  shared  memory  algorithm  there  is  an  internal  synchronization 
of  processes  when  accessing  to  the  unique  copy  of  3D  data.  The  best  execution 
times  for  the  mesh  deformation  stage  using  the  shared  memory  algorithm  are  due 
to  the  lack  of  exchange  of  data  (except  the  local  error  for  each  node),  while  in  the 
optimized  MPI  algorithm  a  global  exchange  of  their  own  jDortion  of  the  spherical 
mesh  among  working  processes  is  required  (MP1-.4LL(1ATHER  primitive,  see 
.Section  3). 

To  study  in  depth  the  behavior  in  the  two  most  time-efficient  algorithms 
(shared  memory  and  optimized  collective  communication  MPI,  respectively), 
they  have  been  compared  with  respect  to  an  object  with  83,171  points.  Table  2 
shows  the  best  e.xecution  results  as  a  function  of  the  resolution  degree  for  both 
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algorithms.  Each  execution  time  is  accompanied  with  the  number  of  software 
processes  involved  in  it.  Experimental  results  show  a  similarity  in  resolution  de¬ 
gree  of  0,  1  and  2:  a  better  jjerformance  of  the  ojjtimized  communication  MPI 
strategy  for  resolutions  of  3  and  4;  and  a  slight  advantage  of  shared  memory  for 
resolution  of  5.  These  global  execution  times  are  mainly  explained  by  the  influ¬ 
ence  of  CLP  calculation  phase  (the  most  time-consuming  one  in  the  algorithm) 
which  has  a  better  behavior  in  the  shared  memory  algorithm  as  the  resolution 
degree  (RD)  grows. 


Resolution  degree  (RO) 


— O—  Sequential 
—  +—  Shered  Memorij 
— h —  MPI  Collective 
MPI  Optimiged 


Fig.  12.  Execution  time  versus  resolution  degree  for  the  different  SAI  algorithms  ap¬ 
plied  to  the  object  objS. 


Table  1.  Global  comparison  of  execution  times  (in  seconds)  for  the  SAI  algorithm 
stages  with  considered  objects  (objl=315  points.  obj2=2928  points,  and  obj3=28656 
points),  RD=3  and  NP=5. 


Sequential 

Shared  Memory 

MPI  Collective 

MPI  Optimized 

objl 

CRE 

4.038 

3.380 

0.746 

CLP 

0.944 

0.001 

DEE 

5.569 

0.978 

46.582 

24.031 

0.016 

0.653 

0.174 

4.046 

2.919 

0.297 

CLP 

8.406 

1.704 

1.360 

0.052 

DEE 

0.971 

5.574 

2.486 

SAI  A 

0.120 

0.075 

obj3 

Milal 

4.669 

0.908 

16.307 

12.943 

5.119 

liiaa 

0.983 

5.791 

0.514 

0.017 

0.111 

0.140 
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Table  2.  Comparison  of  shared  memory  and  optimized  collective  MPI  strategies  for 
an  object  with  83.171  points. 


Re.solut.ion 

degree 

Shared  Memory 

MPI  Optimized 

troTAL  (sec.) 

NP 

tTOTAL  (sec.) 

NP 

0 

0.412 

4 

0.527 

2 

1 

0.556 

5 

0.865 

ri“ 

2 

1.033 

16 

1.493 

5 

3 

7.113 

10 

4.239 

5 

4 

20.074 

16 

16.143 

8 

5 

99.027 

16 

102.328 

8 

5  Conclusions  and  Future  Work 

Results  of  several  parallel  implementations  of  the  SAI  model  computation  have 
been  presented  in  this  paper.  The  parallel  strategies,  one  shared  memory  and 
two  MPI-based  algorithms,  have  been  compared  with  the  sequential  version. 
Due  to  the  different  nature  of  each  stage  of  the  SAI  algorithm,  shared  memory 
implementation  gives  better  execution  results  than  MPI  implementations  when 
less  communication  is  involved  on  the  computation,  i.e.  on  stages  DEF  and  SAIA. 
On  the  other  hand,  when  the  locality  of  the  computations  allows  the  distribution 
of  the  workload  between  all  the  available  processes,  results  are  favourable  to  the 
MPI  implementations,  i.e.  on  stages  CRE  and  CLP. 

It  should  be  mentioned  that  the  development  of  the  MPI  implementations 
has  been  easier  than  using  the  shared  memory  primitives  of  a  high  level  language, 
due  to  the  fact  that  the  MPI  distribution  (we  have  employed  CHIMP  vs.  2.0) 
allows  the  use  of  powerful  collective  parallel  programming  primitives  than  the 
standard  libraries  available  for  shared  memory  concurrency. 

A  future  improvement  is  focused  on  stage  CLP,  whose  execution  time  could 
perhaps  be  reduced  using  3D  computational  geometry  techniques.  Another  de¬ 
velopment  will  be  the  parallelization  of  the  sequential  version  using  the  SGI 
parallel  compiler  in  order  to  compare  the  results  with  that  provided  by  a  com¬ 
mercial  compiler. 

Actually,  the  surface  reconstruction  algorithm  reduces  the  approximation 
error  when  conv'ex  objects  are  considered.  We  think  this  error  can  be  reduced  for 
non-coiivex  objects  when  the  points  of  the  considered  object,  closest  to  each  node 
of  the  spherical  mesh,  are  recomputed  in  each  iteration  during  the  deformation. 
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Abstract.  In  this  paper,  we  present  parallel  implementations  of  con¬ 
nected  operators.  These  kind  of  operators  have  been  recently  defined  in 
mathematical  morphology  and  have  attracted  a  large  amount  of  research 
due  to  their  efficient  use  in  image  processing  applications  where  contour 
information  is  essential  (image  segmentation,  pattern  recognition  ...  ). 
In  this  work,  we  focus  on  connected  transformations  based  on  geodesic 
reconstruction  process  and  we  present  parallel  algorithms  based  on  ir¬ 
regular  data  structures.  We  show  that  the  parallelization  poses  several 
problems  which  are  solved  by  using  appropriate  communication  schemes 
as  well  2is  advcinced  data  structures. 


1  Introduction 

In  the  area  of  image  processing,  mathematical  morphology  [1]  has  always  proved 
its  efficiency  by  providing  a  geometrical  approach  to  image  interpretation.  Thus, 
contrary  to  usual  approaches,  image  objects  are  not  described  by  their  frequential 
spectrum  but  by  more  visual  attributes  such  as  size,  shape,  contrast  .... 

Recently,  complex  morphological  operators,  known  as  connected  operators  [2], 
have  been  defined  and  become  today  increasingly  popular  in  image  processing 
because  they  have  the  fundamental  property  of  simplifying  an  image  without 
corrupting  contour  information.  Through  this  property,  this  class  of  operators 
can  be  used  for  all  applications  where  contour  information  is  essential  (image 
segmentation,  pattern  recognition  . . .  ). 

In  this  paper,  we  focus  on  the  parallelization  of  connected  operators  based 
on  a  geodesic  reconstruction  process  [3]  aimed  at  reconstructing  the  contours 
of  a  reference  image  from  a  simplified  one.  In  spite  of  their  efficiency  in  the 
sequential  case,  these  transformations  are  difficult  to  parallelize  efficiently,  due 
to  complex  propagation  and  re-computation  phenomena,  and  thus,  advanced 
communication  schemes  and  irregular  data  structures  have  to  be  used  in  order 
to  solve  these  problems. 

The  proposed  parallel  algorithms  are  designed  for  MIMD  {Multiple  Instruc¬ 
tion  Multiple  Data)  architectures  with  distributed  memory  and  use  a  message 
passing  programming  model. 
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This  paper  is  organized  as  follows  :  section  2  introduces  the  theoretical  fun- 
dations  of  morphological  connected  operators  and  presents  the  connected  trans¬ 
formations  based  on  geodesic  reconstruction  process.  Section  3  makes  a  survey 
of  the  most  efficient  sequential  implementations  which  are  used  as  starting  point 
of  the  parallelization.  In  section  4,  we  propose  to  detail  all  parallel  algorithms  for 
morphological  reconstruction  and  the  experimental  results  obtained  on  a  IBM 
SP2  machine  are  presented  in  section  5.  Finally,  we  conclude  in  section  6. 

2  Overview  of  Morphological  Connected  Operators 

This  section  presents  the  concept  of  morphological  connected  operators  and  we 
invite  the  reader  to  refer  to  [2,4,5]  for  a  more  theoretical  study. 

In  the  framework  of  mathematical  morphology  [1],  the  basic  working  struc¬ 
ture  is  the  complete  lattice.  Let  us  recall  that  a  complete  lattice  is  composed 
of  a  set  equiped  with  a  total  or  partial  order  such  that  each  family  of  elements 
{x-,  }  possesses  a  suprenum  V{x,}  and  an  infinum  A{x,}.  In  the  area  of  grey- 
level  image  processing,  the  lattice  of  functions  (where  the  order,  V  and  A  are 
respectively  defined  by  <,  Max  and  Min)  is  used. 

Following  this  structure,  the  definition  of  grey-level  connected  operators  has 
been  given  in  [2,4]  by  using  the  notion  of  partition  of  flat  zones.  Let  us  recall  that 
a  partition  of  a  space  E  is  a  set  of  connected  component  {A,}  which  are  disjoints 
(AiC\Aj  =  0  V?'  ^  j)  and  the  union  of  which  is  the  entire  space  (U^i  =  E).  Each 

connected  component  Ai  is  then  called  a  partition  class.  Moreover,  a  partition 
{.4,}  is  said  to  be  finer  than  a  partition  {B,  }  if  any  pair  of  points  belonging  to 
the  same  class  Ai  also  belongs  to  a  unique  partition  class  Bj .  Finally,  the  flat 
zones  of  a  grey-level  function  /  are  defined  by  the  set  of  connected  components 
of  the  space  where  /  is  constant.  In  [2],  the  authors  have  shown  that  the  set  of 
flat  zones  of  a  function  defines  a  partition  of  the  space,  called  the  partition  of 
flat  zones  of  the  function. 

From  these  notions,  a  grey-level  connected  operator  can  be  formally  defined 
as  follows  ; 

Definition  1.  An  operator  acting  on  grey-level  functions  is  said  to  be  con¬ 
nected  if,  for  any  function  /,  the  partition  of  flat  zones  of  ^(f)  is  less  fine  than 
the  partition  of  flat  zones  of  /. 

This  last  definition  shows  that  connected  operators  have  the  fundamental 
property  of  simplifying  an  image  while  preserving  contour  information.  Indeed, 
since  the  associated  partition  of  !?'(/)  is  less  fine  than  the  associated  partition 
of  /,  each  flat  zone  of  /  is  either  preserved  or  merged  in  a  neighbor  flat  zone. 
Thus,  no  new  contour  is  created. 

Following  this  principle,  we  can  decompose  a  grey-level  connected  operator 
in  two  steps  ;  the  first,  one,  called  selection  step,  assesses  a  given  criterion  (size, 
contrast,  complexity  ...  )  for  each  flat  zone  and  from  this  criterion,  the  second 
step  called  decision  step  decides  to  preserve  or  merge  the  flat  zone.  A  large  num¬ 
ber  of  grey-level  connected  operators  can  thus  be  defined,  only  differencing  in 
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the  criterion  used  by  the  selection  step.  For  a  presentation  of  different  connected 
operators,  refer  to  [6]. 

In  this  paper,  we  focus  on  connected  operators  based  on  geodesic  reconstruc¬ 
tion  process  [3] .  This  kind  of  transformation  is  actively  used  in  applications  such 
as  grey-level  and  color  image  and  video  segmentation  [7-9] ,  default  detection  [10], 
texture  classification  [11]  and  so  forth.  The  reconstruction  process  is  based  on 
the  definition  of  the  elementary  geodesic  dilation  ,  g)  of  the  grey-level  image 
g  <  f  “under”  /  defined  by  =  Min{5\{g),  f]  where  <5i(5)  defines  the  el¬ 

ementary  numerical  dilation  of  g  given  by  (5i(5)(p)  =  Max{g[q),q  £ 
with  N{p)  denoting  the  neighborhood  of  the  pixel  p  in  the  image  g. 

The  geodesic  reconstruction  of  a  marker  image  g  with  reference  to  a  mask 
image  f  with  g  <  f  is  obtained  by  iterating  elementary  gray-level  geodesic 
dilation  of  g  “under”  /  until  stability  : 


P(9  I  /)  =  •  •  {S^^Hf,g)) ■■■))  (i) 

Figure  1  shows  a  ID  example  of  the  reconstruction  process  p(g  \  f)  for  which 
we  can  note  that  all  contours  of  the  function  /  are  perfectly  preserved  after  the 
reconstruction. 


f 

Fig.  1.  ID  example  of  grayscale  reconstruction  of  mask  f  from  marker  g 

Following  the  technique  used  to  obtain  the  marker  image  fir,  it  is  straightfor¬ 
ward  that  a  large  number  of  connected  operators  can  be  defined  .  Two  connected 
operators,  based  on  this  reconstruction  process,  are  intensively  used  in  literature. 
The  first  one,  known  as  opening  by  reconstruction,  has  a  size  oriented  simplifi¬ 
cation  effect  since  the  marker  image  is  obtained  by  a  morphological  opening 
removing  all  bright  objects  smaller  than  the  size  of  the  structuring  element.  The 
second  one,  known  as  A  —  Max  operator,  has  a  contrast  oriented  simplification 
effect  since  the  marker  image  is  obtained  by  substracting  a  constant  A  to  the 
mask  image  (5  =  /  —  A).  Figure  2  shows  these  two  transformations.  From  the 
Cameraman  test  image  (see  Figure  2(a)),  Figure  2(b)  shows  the  result  of  an 
opening  by  reconstruction  pi'jisif)  \  f)  where  7n(/)  denotes  the  morphological 
opening  of  the  function  /  using  a  structuring  element  of  size  n.  Figure  2(c)  shows 
a  A  —  Max  operator  with  A  =  40  denoted  by  p{f  —  40  [  /). 

Note  that  for  each  defined  connected  operator,  a  dual  transformation  can 
be  obtained  by  reversing  the  Max  and  Min  operator.  We  thus  obtain  a  closing 
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Fig.  2.  Example  of  geodesic  reconstruction  on  grey-level  images 

by  reconstruction  transformation,  that  removes  all  dark  objects  smaller  than 
the  structuring  element,  and  a  \  -Min  transformation,  that  removes  all  objects 
which  have  a  constrast  higher  than  A.  Due  to  this  duality  relation,  we  only  study 
in  this  paper  the  reconstruction  process  by  geodesic  dilation  (see  equation  (1)). 

3  Efficient  Sequential  Implementations  of  Geodesic 
Reconstruction 

The  most  efficient  algorithms  for  geodesic  reconstruction  have  been  proposed  in 
[3].  In  this  paper,  we  only  focus  on  implementations  based  on  irregular  data  struc¬ 
tures  which  consider  the  image  under  study  as  a  graph  and  realize  a  breadth-first 
.scanning  of  the  graph  from  strategically  located  pixels  [12],  These  algorithms 
proceed  in  two  main  steps  ; 

-  detection  of  the  pixels  which  can  initiate  the  reconstruction  process, 

-  propagation  of  the  information  only  in  the  relevant  image  parts. 

In  the  case  of  reconstruction  by  geodesic  dilation  (see  equation  (1)),  we  can 
easily  show  that  a  pixel  p  can  initiate  the  reconstruction  of  a  mask  image  /  from 
a  marker  image  g  if  it  has  in  its  neighborhood  at  least  one  pixel  q  such  that 
5(9)  <  9{P)  and  g{q)  <  f(g). 

Proposition  2.  In  the  reconstruction  of  a  mask  image  f  from  a  marker  image 
g  with  g  <  f,  the  only  pixels  p  which  can  propagate  their  grey-level  value  in  their 
neighborhood  verify  :  3g  €  N{p),g(q)  <  g(p)  and  g(g)  <  f(g). 

Proof  Let  p  be  a  pixel  such  that  3g  €  N(p),g{g)  <  g(p)  and  g{q)  <  f{q). 

We  have  S^^l(f,g){g)  =  Min{Si(g)(q),  f(g)}  with  Si[g)(q)  =  Max{g(t).t  G 
A^(9)U{9}}. 

.Since  q  G  A' (p),  we  have  p  G  N[q)-  Moreover,  we  know  that  g{q)  <  g{p)  and 
thus,  q  is  not  a  fixed  point  of  the  transformation  eg  Si{g){q)  >  g(g). 
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On  the  other  hand,  since  g{q)  <  f{q),  the  transformation  has  not 

reached  stability  at  the  location  of  q.  As  a  result,  the  pixel  q  can  receive  a 
propagation  from  the  pixel  p. 

To  conclude  the  proof,  it  is  straightforward  that  if  V?  G  N(p),g(q)  >  g(p), 
the  pixel  p  cannot  propagate  its  grey-level  value  but  receives  a  propagation  from 
its  neighborhood.  O 

Note  that  a  same  pixel  can  propagate  its  grey-level  value  on  a  neighbor  pixel 
and  receive  a  propagation  from  another  neighbor  pixel. 

Following  this  principle,  two  methods  have  been  proposed  in  [3]  to  detect 
initiator  pixels.  The  first  one  consists  in  computing  the  regional  maxima  of  the 
marker  image  g  which  designate  the  set  of  connected  components  M  with  a  grey- 
level  value  h  in  g  such  that  every  pixel  in  the  neighborhood  of  M  has  a  strictly 
lower  grey-level  value  than  h.  In  this  case,  initiator  pixels  are  those  located  in 
the  interior  boundaries  of  regional  maxima  of  g . 

The  second  method  is  based  on  two  scanning  of  the  marker  image  g.  The 
first  scanning  is  done  in  a  raster  order  (from  top  to  bottom  and  left  to  right) 
and  each  grey-level  value  g{p)  becomes  g{p)  =  Min{M ax{g{q),q  G  N'^(p)  U 
{p}}J(p)]  where  iV+(p)  is  the  set  of  neighbors  of  p  which  are  reached  before  p 
in  a  raster  scanning  (see  Figure  3(a)).  The  second  scanning  is  done  in  the  anti¬ 
raster  order  and  each  grey-level  value  g{p)  becomes  g{p)  =  Min{Max{g(q),q  G 
N~{p)  U  {p}},f{p)}  where  N~{p}  designates  the  neighbors  of  p  reached  after  p 
in  the  raster  scanning  order  (see  Figure  3(b)).  In  this  case,  the  initiator  pixels 
are  detected  in  the  second  scanning  and  are  those  which  could  propagate  their 
grey-level  value  during  the  next  raster  scanning.  These  pixels  p  verify  :  3q  G 
^'~{p),9(9)  <  9{P)  and  g(q)  <  f{q). 


(a)  (b) 


Fig.  3.  Definition  of  N'*'(p)  (a)  and  N  (p)  (b)  in  the  case  of  4-connectivity 

One  can  easily  see  that  these  two  methods  verify  the  general  principle  exposed 
in  Proposition  2. 

Once  the  initiator  pixels  are  detected,  the  information  has  to  be  propagated 
by  a  breadth-first  scanning.  For  this  purpose,  the  breadth-first  scanning  is  im¬ 
plemented  in  [3]  by  a  queue  of  pixels  represented  by  a  FIFO  {First  In  First 
Out)  data  structure.  The  initiator  pixels  are  first  inserted  in  the  queue  and  the 
propagation  consists  in  extracting  the  first  pixel  p  from  the  queue  and  propagat¬ 
ing  its  grey-level  value  g(p)  to  all  of  its  neighbors  q  such  that  g(q)  <  g(p)  and 
5(9)  <  /(?)•  The  grey-level  of  these  pixels  becomes  then  g{q)  =  M in{g(p),  f{q)] 
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and  the  last  operation  consists  in  inserting  these  pixels  q  in  the  queue  in  order 
to  continue  the  propagation.  The  reconstruction  stops  when  the  queue  is  empty. 


4  Parallel  Implementations  of  Geodesic  Reconstruction 

4.1  Preliminaries 


In  this  paper,  we  are  interested  in  the  parallelization  of  morphological  recon¬ 
struction  based  on  geodesic  dilation  since  by  duality,  reconstruction  by  geodesic 
erosion  can  be  obtained  by  reversing  the  M in  and  Max  operators  in  equation 
(1). 

All  proposed  parallel  algorithms  are  designed  for  MIMD  (Multiple  Instruction 
Multiple  Data)  architectures  with  distributed  memory  and  use  a  message  passing 
programming  model.  Based  on  this  parallel  context,  we  denote  by  p  the  number 
of  processors  and  by  Pi  the  processor  indexed  by  i  (0  <  f  <  p). 

The  mask  image  /  and  the  marker  image  g,  for  which  the  domain  of  defini¬ 
tion  is  denoted  by  D,  are  splitted  into  p  sub-images  /,•  and  5,-  defined  on  disjoint 
domains  Di  (0  <  i  <  p).  For  the  sake  of  simplicity,  we  assume  that  the  partition- 
ning  is  made  in  an  horizontal  ID- rectilinear  fashion.  Thus,  if  we  suppose  that 
/  and  g  are  of  size  n  x  n,  each  processor  Pi  owns  a  mask  sub-image  /,  and  a 
marker  sub-image  5,-,  each  of  size  j  x  n.  Note  that  all  proposed  algorithms  can 
be  extended  to  a  2D- rectilinear  partitionning  scheme  without  difficulty. 

Moreover,  in  order  to  study  the  propagation  property  of  its  local  pixels  on  non 
local  pixels  located  in  neighbor  sub-images,  each  processor  P,  owns  two  1-pixel 
wide  overlapping  zones  on  its  neigbhor  sub-images  /,_i  and  gi^i  (for  ?  >  0),  and 
fi+i  and  gi+i  (for  f  <  p  -  1).  From  this  extended  sub-domain,  denoted  by  EDi. 
we  denote  by  LB(i,j)  the  local  pixels  located  on  the  frontier  between  P,  and  Pj 
that  is  LB[i,  j)  =  Di  r\EDj,  and  by  NB{i,j)  the  non  local  pixels  located  on 
the  frontier  between  P,-  and  Pj  that  is  NB{i,j)  =  EDi  H  Dj.  Figure  4  shows  all 
of  these  notations. 


Fig.  4.  ID  rectilinear  data  partitionning 
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On  the  other  hand,  we  denote  by  Nd{p)  the  neighborhood  of  the  pixel  p  in 
the  domain  D  and  by  Nd,(p)  the  neighborhood  of  the  pixel  p  restricted  to  the 
sub-domain  Dj . 

Before  presenting  our  parallel  geodesic  reconstruction  algorithms,  we  would 
like  to  underline  the  difficulty  of  this  parallelization  and  show  the  motivation 
of  this  work.  As  mentioned  in  section  2,  geodesic  reconstruction  is  not  a  locally 
defined  transformation  resulting  in  a  complex  propagation  phenomenon,  that 
makes  it  difficult  to  parallelize  efficiently.  If  we  consider  the  example  of  recon¬ 
struction  by  geodesic  dilation  (see  equation  (1)),  we  remark  that  a  bright  pixel 
can  propagate  its  grey-level  value  to  a  large  part  of  the  image  and  no  technique 
allows  us  to  predict  this  phenomenon  a  priori.  Now,  in  parallel  implementations 
where  initial  images  are  splitted  and  distributed  among  processors,  each  proces¬ 
sor  proceeds  reconstruction  from  its  local  data  and  thus,  when  a  propagation 
phenomenon  appears  and  crosses  inter-processors  frontier,  it  can  generate  the 
re-computation  of  a  large  number  of  pixels  located  in  the  sub-image  that  have 
received  the  propagation.  Figure  5  illustrates  this  phenomenon  with  a  ID  dis¬ 
tribution  in  which  the  mask  image  /  and  the  marker  image  g  are  distributed 
on  two  processors  Po  and  Pi.  On  this  example,  the  processor  Pi  owns  a  dark 
marker  sub-image  whereas  the  processor  Po  owns  a  bright  marker  sub-image 
(see  Figure  5(a)).  In  a  computational  point  of  view,  if  Pi  entirely  reconstructs 
its  local  image  before  taking  into  account  the  propagation  messages  sent  by  Po, 
it  is  straightforward  that  Pi  will  have  to  re-compute  a  large  number  of  pixels 
and  the  resulting  total  execution  time  will  be  slowed  down  (see  Figures  5(b,c)). 
The  motivation  of  this  work  is  thus  to  adopt  communication  schemes  adapted 
to  this  inter-processors  propagation  phenomenon  and  to  propose  efficient  data 
structures  limiting  the  re-computation  phenomenon. 


Fig.  5.  Propagation  and  re-computation  phenomenons 

In  this  paper,  we  only  focus  on  parallel  geodesic  reconstruction  algorithms 
based  on  irregular  data  structures  represented  by  pixel  queues  because  of  their 
efficiency  in  the  sequential  case.  As  mentionned  in  section  3,  sequential  algo¬ 
rithms  proceed  in  two  steps  ;  detection  of  initiator  pixels  and  propagation  of  the 
information.  One  can  easily  remark  that  parallel  algorithms  which  follow  this 
principle  are  very  irregular  since  on  one  hand,  the  number  of  detected  initiator 
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pixels  can  differ  on  each  processor  and  on  the  other  hand,  the  propagation  of  in¬ 
formation  can  differently  evolve  on  each  processor.  Moreover,  from  this  irregular 
nature,  communication  primitives,  that  have  to  take  place  in  order  to  propagate 
information  between  processors,  will  be  irregular  since  the  number  of  sent  and 
received  propagations  can  differ  on  each  processor. 

4.2  Parallel  Reconstruction  Based  on  Regional  Maxima 

The  parallel  algorithm  based  on  the  use  of  marker  image  regional  maxima  pro¬ 
ceeds  in  tW'O  steps.  In  the  first  one,  all  processors  compute  in  parallel  the  regional 
maxima  of  the  marker  image  g  (for  this  step,  refer  to  [13])  and  product  a  tem¬ 
porary  image  7?.  defined  by  : 

p,  ,  _  f  g{p)  if  p  belongs  to  a  regional  maxima 
^  1 0  otherwise. 

From  the  image  72.,  each  processor  P,  initializes  its  local  pixel  queue  F,  by 
inserting  all  pixels  p  £  A  located  on  internal  boundaries  of  regional  maxima  : 


Fi  =  {p  £  Di,1Z{p)  ^  0  and  3g  £  Nd[p)  such  that  72(g)  =  0}. 

During  the  last  step,  the  information  is  propagated  through  the  image  and 
each  processor  can  then  receive  a  propagation  from  its  neighbor  processors.  These 
interactions  are  implemented  by  communication  primitives  and  two  methods  can 
be  proposed  for  this  step  :  the  synchronous  approach  in  which  communications 
are  considered  as  synchronization  points  which  regularly  appear,  and  the  asyn¬ 
chronous  approach  in  which  communications  are  exchanged  immediatly  after  the 
detection  of  propagation. 

Synchronous  Approach.  In  a  first  time,  each  processor  P,  applies  the  re¬ 
construction  process  starting  from  initiator  pixels  inserted  in  P,-  after  the  com¬ 
putation  of  regional  maxima.  After  the  consumption  of  P,-,  each  processor  P,- 
exchanges  with  its  neighbors  Pj  {j  =  f  -  1,7  -|-  1)  all  pixels  of  gi  located  in 
From  the  received  data  and  the  local  ones,  each  processor  P,-  can  then 
detect  its  local  pixels  that  have  received  a  propagation.  These  pixels  p  are  those 
located  in  LB{7,j)  (j  =  7  —  1,  ? -f- 1)  and  which  have  a  neighbor  q  located  in  the 
neighbor  sub-image  sij  such  that  gj{q)  >  gi{p)  and  g,(p)  <  /.  (p).  The.se  pixel  val¬ 
ues  become  then  gi{p)  =  F^j^{9j{p),  fi(p))  (see  Figure  6).  These  pixels  form  a 
new'  set  of  initiator  pixels  and  are  thus  inserted  in  turn  in  P, .  The  reconstruction 
process  can  then  be  iterated  with  this  new  set. 

This  general  scheme  is  iterated  n,f  times  until  a  global  stabilization,  which 
is  detected  w'hen  no  processor  receives  a  propagation  from  its  neighbors. 

This  approach  follows  a  “divide  and  conquer”  programming  paradigm  and 
can  be  qualified  as  semi-irregular  since  the  local  propagation  step  remains  irreg¬ 
ular  because  based  on  FIFO  data  structure  but  communications  are  performed 
in  a  regular  and  synchronized  fashion. 
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After 

propagation 


Fig.  6.  Example  of  inter-processor  propagation 

The  main  disadvantage  of  this  technique  is  to  take  into  account  the  inter¬ 
processor  propagations  at  latest,  resulting  in  an  important  re-computation  phe¬ 
nomenon.  Moreover,  a  propagation  starting  on  a  processor  P,-  and  going  up  to 
Pi+k  {k  €  [-/.p-j-  1])  will  be  take  into  account  by  Pi+k  after  k  communication 
steps.  Finally,  due  to  the  regular  nature  of  the  communications,  no  overlapping 
of  computation  and  communications  are  proposed  by  this  approach. 


Asynchronous  Approach.  Contrary  to  the  previous  approach,  inter-processor 
propagation  messages  are  here  exchanged  as  one  goes  along  their  detection  in 
order  to  take  them  into  account  as  soon  as  possible.  This  technique  solves  the 
problems  posed  by  the  synchronous  approach  since  it  reduces  the  amount  of  com¬ 
munications  crossing  the  network  and  limits  the  re-computation  phenomenon. 

In  this  approach,  the  reconstruction  stops  when  all  processors  have  cleaned 
their  local  pixel  queue  and  when  all  sent  propagation  messages  have  been  taken 
into  account  by  their  recipients.  In  order  to  detect  this  stabilization  time,  a  token 
is  used  reporting  all  sent  propagation  messages  which  have  not  yet  been  taken 
into  account.  This  token  is  implemented  by  a  vector  T  of  size  p  in  which  T[i] 
(0  <  ?'  <  p)  designates  the  number  of  messages  sent  to  P,  and  not  yet  received. 
Thus,  the  global  stabilization  is  reached  when  all  entries  of  T  become  nil. 

In  the  same  manner  of  the  synchronous  approach,  when  the  grey-level  value 
of  a  boundary  pixel  p  is  modified  by  a  processor  Pi,  a  message  is  sent  to  the 
corresponding  neighbor  processor  Pj  indicating  the  position  x  and  the  new  grey- 
level  value  gi{p)  of  the  modified  pixel  p.  The  processor  Pj  receiving  a  message 
(x,gi{p))  of  this  kind  takes  into  account  the  propagation  effect  of  the  pixel  p  on 
its  own  marker  sub-image  gj.  For  this  purpose,  it  scans  all  pixels  q  located  in 
gj  and  in  the  neighborhood  ot  p  (q  e  Noilp))  and  it  reports  propagation  on  all 
of  these  pixels  verifying  gj(q)  <  gi(p)  and  gj{q)  <  fj{q)-  The  grey-level  value  of 
these  pixels  becomes  then  gj(q)  =  Min{gi{p),  fj(q]} .  The  last  step  consists  in 
inserting  the  pixel  q  in  the  queue  Fj . 

Finalh",  once  a.  processor  Pi  has  finished  its  local  reconstruction  (P,-  =  0),  it 
enters  in  a  blocking  state  until  the  global  stabilization  is  reached.  In  this  state, 
it  can  either  receive  a  propagation  message  from  a  neighbor  processor  that  have 
not  yet  terminated  its  reconstruction  or  a  message  for  the  token  management. 
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In  these  synchronous  and  asynchronous  irregular  algorithms,  the  consump¬ 
tion  order  of  the  pixels  during  the  propagation  step  directly  depends  on  their 
insertion  order  in  the  FIFO,  which  depends  in  turn  on  the  image  scanning  order 
in  the  initialization  phase.  Following  this  observation,  we  can  remark  that  for 
a  sequential  as  well  as  parallel  geodesic  reconstruction,  some  pixels  of  marker 
image  can  be  modified  more  than  once  by  receiving  propagations  from  several 
pixels.  In  the  case  of  reconstruction  based  on  regional  maxima,  this  problem 
appears  when  some  pixels,  which  have  received  a  propagation  from  a  regional 
maxima  All  with  a  grey-level  value  h\,  can  also  be  reached  by  a  regional  max¬ 
ima  M2  with  a  grey-level  value  h2  >  h\.  Let  us  consider  the  example  illustrated 
on  Figure  7(a)  where  the  marker  image  has  two  regional  maxima  A^i  and  M2 
with  a  respective  grey-level  value  of  10  and  20.  For  the  sake  of  simplicity,  we 
assume  that  the  mask  image  has  a  constant  grey-level  value  equal  to  25.  Sup¬ 
pose  now  that  during  the  initiator  pixels  detection  step,  the  pixels  located  in 
internal  boundaries  of  M 1  are  inserted  in  the  FIFO  before  the  pixels  located  in 
internal  boundaries  of  M2-  On  this  Figure,  where  arrows  designate  the  prop¬ 
agations  sense,  each  maximum  is  extended  to  its  neighborhood.  At  the  second 
step  of  propagation  (see  Figure  7(b)),  the  maximum  M2,  that  has  a  grey-level 
value  higher  than  hi ,  begins  to  generate  the  re-computation  of  some  pixels  pre¬ 
viously  modified  by  the  propagation  phase  initialized  from  A^i.  As  one  goes 
along  of  iterations,  each  of  maximum  is  spatially  extended  and  all  pixels  located 
in  the  intersection  of  these  extensions  are  re-computed  (see  Figure  7(c,d)  where 
re-computation  are  designed  by  dark  grey  dashed  square).  At  the  end  of  the 
reconstruction,  all  pixels  of  marker  image  have  a  grey-value  of  20  and  thus,  all 
pixels  modified  during  the  extension  of  A^i  have  been  re-computed  by  receiving 

a  propagation  from  A^  2- 


Fig.  7.  Re-computation  phenomenon  for  reconstruction  based  on  regional  ma.xima 


It  would  be  thus  interesting  to  propose  a  data  structure  ensuring  the  mod¬ 
ification  unicity  of  each  pixel.  For  this  purpose,  it  is  important  to  observe  that 
in  the  case  of  reconstruction  by  geodesic  dilation  (see  equation  (1)),  the  prop¬ 
agation  sense  is  always  from  bright  pixels  to  dark  ones,  and  thus  we  can  affect 
a  priority  of  reconstruction  to  each  pixel,  attached  with  its  gre3'-level  value.  In 
the  case  of  reconstruction  by  geodesic  dilation  of  a  mask  image  /  from  a  marker 
image  g,  each  pixel  p  will  thus  be  inserted  in  the  queue  with  a  priority  given 
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by  9{p)-  One  can  easily  see  that  this  technique  cancels  the  re-computation  phe¬ 
nomenon  in  the  sequential  case  and  limits  it  as  far  as  possible  in  the  parallel 
case.  Moreover,  the  priority  mechanism  can  be  efficiently  implemented  by  using 
hierarchical  FIFO  [14]  with  as  much  priority  as  grey-level  values  in  the  marker 
image. 

From  this  new  data  structure,  all  proposed  algorithms  can  be  rewritten  by 
only  modifying  the  calls  to  FIFO  management. 

4,3  Hybrid  Parallel  Reconstruction 

In  the  previous  implementations,  a  large  part  of  the  reconstruction  time  is  dedi¬ 
cated  to  the  computation  of  regional  maxima.  To  solve  this  problem,  it  has  been 
proposed  in  [3]  to  detect  initiator  pixels  from  two  scanning  of  marker  image  as 
explained  in  section  3.  Following  this  principle,  all  techniques  presented  in  the 
previous  section  and  devoted  to  the  propagation  step  can  be  applied  here  since 
the  only  modified  step  concerns  with  the  detection  of  initiator  pixels.  Thus,  the 
parallel  algorithms  based  on  this  technique  proceed  in  two  steps  : 

-  detection  of  initiator  pixels  from  two  scannings  of  marker  sub-image, 

-  propagation  of  information  in  a  synchronous  or  asynchronous  way  by  using 

a  classical  or  hierarchical  FIFO. 

5  Experimental  Results 

In  this  section,  we  present  some  experimentations  of  our  parallel  algorithms 
obtained  on  a  IBM  SP2  machine  with  16  processing  nodes  by  using  the  MPI 
{Message  Passing  Interface)  [15]  communication  library.  For  these  measures, 
four  test  images  [Cameraman,  Lena,  Landsat  and  Peppers)  have  been  used,  each 
of  size  256  x  256.  For  Cameraman  and  Lena  test  images,  a  parallel  opening  by 
reconstruction  has  been  measured  for  which  the  marker  image  has  been  obtained 
by  a  morphological  opening  of  size  5  and  15  respectively.  For  Landsat  and  Peppers 
test  images,  a  A  -  Max  operator  has  been  measured  for  which  the  marker  image 
has  been  obtained  by  substracting  the  constant  values  20  and  40  respectively  to 
the  mask  image. 

As  explained  in  section  4.3,  the  most  promising  parallel  algorithms  uses  two 
marker  image  scannings  in  order  to  detect  initiator  pixels  and  thus,  we  only 
analyse  in  this  section  the  four  algorithms  based  on  this  technique  : 

-  algorithm  1  :  synchronous  propagation  step  and  use  of  classical  FIFO, 

-  algorithm  2  :  asynchronous  propagation  step  and  use  of  classical  FIFO, 

-  algorithm  3  :  synchronous  propagation  step  and  use  of  hierarchical  FIFO, 

-  algorithm  4  :  asynchronous  propagation  step  and  use  of  hierarchical  FIFO. 

Figure  8  shows  the  speedup  coefficients  obtained  by  all  proposed  algorithms 
by  using  the  four  test  images.  First  of  all,  we  can  note  that  these  coefficients  are 
limited  because  of,  on  one  hand,  the  irregularity  of  the  proposed  algorithms,  and 
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Fig.  8.  Speedup  coefficients  of  eiU  proposed  cJgorithms 

on  the  other  hand,  the  small  size  of  the  processed  images.  However,  these  results 
well  show  the  behavior  of  our  algorithms.  We  can  thus  observe  that  synchronous 
approaches  (algorithms  1  and  3)  are  not  adapted  to  the  marked  irregularity  of 
the  algorithms.  Indeed,  the  experimentations  have  shown  that  the  number  of 
iterations  ??,>  needed  to  reach  global  stabilization  increases  with  the  number  of 
processors.  Indeed,  as  a  propagation  can  only  cross  an  inter-processor  frontier 
at  each  communication  step,  it  is  straightforward  that  np  is  bounded  by  the 
number  of  processors.  This  fact  explains  the  chaotic  behavior  of  the  algorithms 
1  and  3  for  the  Lena  test  image  from  10  processors.  Moreover,  we  can  remark 
that  these  synchronous  algorithms  can  also  bring  a  totally  sequential  behavior 
for  some  images  where  the  evolution  of  the  grey-level  values  is  continuous  from 
top  to  bottom.  As  a  result,  we  can  conclude  that  parallel  geodesic  reconstruction 
algorithms  based  on  synchronous  approach  are  not  scalable. 

The  algorithm  2  presents  very  limited  speedup  caused  by  the  use  of  a  classical 
FIFO  data  structure.  Indeed,  as  explained  in  section  4,  this  data  structure  does 
not  ensure  the  unicity  of  modification  of  the  pixels  and  thus,  a  communication 
is  performed  each  time  a  pixel  located  on  an  inter-processor  frontier  is  modi¬ 
fied.  As  a  result,  the  number  of  communication  is  higher  than  for  synchronous 
approaches  since  for  these  approaches,  a  communication  is  performed  only  after 
the  termination  of  all  local  reconstruction. 

Finally,  the  algorithm  4  gives  the  best  results  and  shows  the  most  regular 
behavior  for  all  test  images.  For  this  small  image  size,  the  relative  efficiency^ 
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is  always  superior  to  40%  and  we  can  note  that  this  algorithm  shows  the  best 
scalability. 


Fig.  9.  Speedup  of  2ilgorithms  with  a  variable  problem  size 

In  order  to  further  study  the  scalability  of  our  algorithms,  we  have  tested 
them  on  a  variable  problem  size.  For  this  purpose,  we  have  used  test  images  of  size 
512  X  512.  Figure  9  shows  speedup  coefficients  of  the  four  proposed  algorithms 
on  the  Lena  test  image  and  for  the  opening  by  reconstruction  p{~f\5{f)  I  /)•  In 
this  Figure,  we  can  note  that  algorithm  4  shows  again  the  best  behavior  and 
the  efficiency  is  here  very  satisfying  since  it  varies  from  63%  to  91%  whereas  the 
efficiency  of  all  other  algorithms  is  inferior  to  38%  with  16  processors. 

From  all  of  these  experimentations,  we  can  conclude  that  algorithm  4  is  the 
best  suited  for  the  implementation  of  connected  operators  based  on  reconstruc¬ 
tion  process.  However,  the  experimentations  have  shown  a  slight  load  imbalance 
due  to  the  distribution  of  grey-level  values  among  processors.  Indeed,  a  processor 
that  reconstructs  a  dark  mask  sub-image  can  receive  propagations  from  neigh¬ 
bor  processors,  that  can  generate  a  large  number  of  re-computation  whereas 
a  processor  owning  a  bright  sub-image  sends  a  lot  of  propagations  toward  its 
neighbors  but  receives  only  a  few  number  of  propagations  resulting  in  a  shorter 
execution  time. 

In  order  to  solve  this  problem,  works  are  in  progress  to  balance  the  workload 
among  processors.  The  retained  technique  is  based  on  elastic  load  balancing 
strategy  proposed  in  [16]  and  on  a  progressive  reconstruction.  First,  we  split  the 
interval  of  grey- level  values  into  an  arbitrary  number  of  sub-intervals.  Then,  at 
the  same  time,  each  processor  only  reconstructs  in  its  own  sub-image,  the  pixels 
whose  grey-level  value  belong  to  the  same  sub-interval.  When  all  pixels  of  a  given 
sub-interval  are  reconstructed,  a  synchronization  barrier  is  called  in  order  to 
ensure  that,  at  each  time,  all  processors  reconstruct  pixels  with  approximatively 
the  same  grey-level  value.  After  this  call,  we  change  the  current  sub-interval  and 
a  dynamic  load  balancing  scheme  is  executed  to  distribute  the  same  grey-level 
values  to  all  pixels.  The  remaining  problem  is  to  give  a  method  (adaptive  or 
empirical)  method  to  split  the  interval  of  grey  levels. 
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6  Conclusion  and  Perspectives 

In  this  paper,  we  have  presented  several  parallel  algorithms  for  geodesic  recon¬ 
struction  based  on  irregular  data  structures.  We  have  shown  that  these  trans¬ 
formations  present  complex  propagation  and  re-computation  phenomena  that 
have  been  solved  in  this  paper  by  using  an  asynchronous  approach  as  well  as  a 
hierarchical  data  structure.  The  resulting  algorithm  shows  a  marked  irregularity 
but  has  proved  its  efficiency  for  all  test  images.  Currently,  works  are  in  progress 
in  order  to  solve  the  load  imbalance  problem. 
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Abstract.  In  the  numerical  simulation  of  crashworthiness  the  use  of 
peireillel  architectures  is  becoming  more  emd  more  important.  This  stems 
from  the  desire  of  engineers  in  the  motorcar  industry  to  get  run  times 
which  make  a  dialogue  feasible.  Parallel  computation  seems  to  be  the  only 
way  to  solve  these  problems  in  an  acceptable  time.  The  computations 
inherent  in  crcishworthiness  simulation  can  be  divided  into  a  contact  tind 
a  non-contact  part.  The  contact  part  leads  in  contrast  to  the  non-contact 
part  to  an  unbsJance  due  to  the  uneven  (in  space  and  time)  distribution  of 
contact.  Good  scalability  becomes  a  challenge.  In  this  paper  we  present 
a  dynamic  load  balcincing  strategy  for  cretsh  simulation.  It  keeps  the 
contact  and  the  non-contact  part  of  the  computation  separately  bcilanced 
over  the  whole  simulation  time.  Results  of  the  dynamic  load  balancing 
algorithm  are  discussed  for  a  contact  secirch  algorithm  applied  to  it. 


1  Introduction 

In  the  numerical  simulation  of  crashworthiness  the  use  of  parallel  architectures 
is  becoming  more  and  more  important.  This  stems  from  the  desire  of  engineers 
in  the  motorcar  industry  to  get  run  times  which  make  a  dialogue  feasible.  To¬ 
day  Finite- Element  models  for  cars,  consist  of  approximately  250000  elements, 
and  more  than  100000  time-steps  are  needed  in  explicit  time-marching  schemes. 
Parallel  computation  seems  to  be  the  only  way  to  solve  these  problems  in  an  ac¬ 
ceptable  time.  For  small  parallel  systems  (less  than  16  processors)  with  shared- 
memory  the  obtainable  speedups  are  very  satisfactory.  But  this  performance  de¬ 
creases  significantly  with  increasing  processor  numbers.  So  for  large  problems  of 
crash  simulation  distributed-memory  MIMD  architectures  are  the  better  choice. 
Standard  domain  partitioning  tools  (see  [1]  and  [2])  employing  a  recursive  spec¬ 
tral  bisection  algorithm  and  trying  to  find  connected  parts  of  equal  size  for 
each  processor,  generate  partitions  with  a  good  balanced  workload  for  the  finite 
element  part  of  the  crash  simulation  algorithm.  In  general  however  the  contact- 
impact  part  of  the  computation  is  distributed  very  unevenly  (in  space  and  time) 
by  such  a  partition  which  leads  to  an  undesired  load  unbalance.  Good  scalability 
becomes  a  challenge  for  crash  simulation.  Some  improvements  were  obtained  by 
giving  higher  weights  (in  the  partitioning  tool)  to  elements  with  expected  con¬ 
tact  and  making  blocks  of  the  partition  with  high  contact  smaller  [3]. 
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To  develop  efficient  parallel  contact  search  algorithms  the  KALCRASH  Project, 
funded  by  the  German  Ministry  for  Research  and  Education  (BMBF),  was  car¬ 
ried  out.  The  research  led  to  a  remarkable  improvement  concerning  the  contact 
calculations  [4].  This  was  valid  for  the  sequential  as  well  as  the  parallel  case.  A 
part  of  the  performance  improvements  was  obtained  by  a  static  load  balancing 
strategy  [5]. 

2  From  Static  to  Dynamic  Load  Balancing 

In  the  following  discussion  we  differ  between  two  parts  of  the  program  for  crash 
simulation,  the  contact  part  (CP)  and  the  non-contact  part  (NCP).  One  obser¬ 
vation  is  the  fact  that  the  two  parts  are  connected  by  necessary  communications. 
The  synchronizing  effects  of  communication  between  CP  and  NCP  in  each  sim¬ 
ulation  step  make  the  strategy  to  compensate  load  unbalance  in  one  part  by  a 
suitable  load  unbalance  of  the  other  part  not  very  successful.  Apart  from  com¬ 
munication  the  overall  computation  time  is  determined  mainly  by  the  sum  of 
the  maximal  computation  times  for  each  part  over  all  processors. 

These  observations  led  to  the  static  load  balancing  strategy  to  distribute  similar 
workloads  to  each  processor  for  each  part  (CP  and  NCP).  Instead  of  using  the 
partition  of  the  domain  partitioning  tool  directly  which  optimises  the  workload 
for  NCP,  a  refined  partition  i.e.  for  a  multiple  number  of  processors,  was  consid¬ 
ered  (over-partitioning).  Then  blocks  of  the  refined  partition  with  high  workload 
concerning  CP  are  combined  with  blocks  with  low  workload  concerning  CP  to 
form  a  block  of  a  new  partition  for  the  given  number  of  processors.  The  blocks  of 
the  refined  partition  are  now  subblocks  of  the  new  partition.  This  new  partition 
fulfils  the  condition  of  similar  workloads  for  all  processors  for  each  part  (CP  and 
NCP).  For  example  a  block  of  the  refined  partition,  (of  the  crashing  zone  i.e. 
front  of  the  car)  was  combined  with  another  block  of  the  refined  partition  of  the 
rear  of  the  car  to  build  a  block  of  a  new  partition. 

A  drawback  of  this  approach  is  that  a  block  generally  consists  of  different  sepa¬ 
rated  subblocks  which  leads  to  some  overhead  in  the  finite  element  part  of  crash 
simulation  concerning  communication.  This  effect  is  investigated  in  the  running 
EU  Project  SIM-VR  where  the  discussed  contact  search  algorithm  (CSA)  was 
successfully  integrated  into  the  crash  simulation  code  PAM-CRASH  of  ESI.  First 
results  show  that  the  gain  in  the  contact  computations  by  far  outweighs  the  loss 
in  the  non-contact  part. 

Another  drawback  of  the  static  load  balancing  approach  is  that  some  skill  and 
time  is  needed  to  put  the  right  subblocks  together  to  obtain  a  good  partition. 
A  good  partition  is  a  partition  which  fulfils  the  condition  of  similar  low  work¬ 
loads  for  all  processors  for  CP  and  NCP  in  the  average  concerning  the  whole  run 
time.  Since  the  workload  concerning  CP  is  changing  a  dynamic  load  balancing 
strategy  based  on  the  static  approach  is  to  perform  an  exchange  of  subblocks 
from  time  to  time  automatically.  Once  a  good  partition  is  found  it  remains  for 
some  time  a  fairly  good  partition  since  the  model  itself  changes  in  general  only 
slowly  (with  respect  to  time-steps).  This  dynamic  load  balancing  approach  lies 
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with  regard  to  its  granularity  to  repartition  every  time-step  and  using  connected 
and  compact  blocks  [6]  on  the  other  extreme.  This  work  and  the  integration  of 
CSA  with  dynamic  load  balancing  into  PAM-CRASH  is  now  to  be  performed  in 
the  BMBF-Project  AUTOBENCH. 


3  Dynamic  Load  Balancing  by  Over-Partitioning 

Starting  with  an  arbitrary  refined  partition  (the  number  of  blocks  is  a  multiple 
of  the  number  of  processors)  given  by  a  domain  partitioning  tool  each  proces¬ 
sor  is  associated  with  the  same  number  of  subblocks.  This  ‘guaranties’  a  good 
load  balance  for  the  non-contact  part.  Since  the  time-steps  in  the  simulation  are 
largely  synchronized  by  the  communication  structure  and  by  construction  we 
get  a  good  load  balance  for  the  non-contact  part  it  remains  to  achieve  good  load 
balance  in  the  contact  part.  This  is  done  by  an  exchange  of  subblocks. 


3.1  When  to  make  Load  Balancing  Steps 

With  a  given  partition  the  simulation  is  performed  for  a  certain  number  (nstep, 
see  Table  1)  of  time-steps  but  in  accordance  with  the  multi-level  contact  search 
algorithm  [5]  (i.e.  outside  the  fast  loop).  At  this  event  each  processor  measures 
the  time  it  needs  for  certain  tasks  which  are  representative  for  the  workload  of 
the  contact  part.  These  tasks  include  the  creation  of  lists  of  neighbouring  but 
nonconnected  nodes  for  each  slave  node  which  is  admissible  for  contact  and  the 
measuring  of  distances  of  elements  associated  with  these  nodes  to  the  slave  nodes. 
These  times  are  made  available  in  an  all-to-all  communication  to  all  processors. 
If  these  times  differ  by  no  more  than  the  special  threshold  parameter  exstop  (see 
Table  1)  no  load  balancing  is  performed  and  the  simulation  is  continued  for  the 
next  nstep  time-steps  after  which  the  same  procedure  is  repeated.  If  they  differ 
by  more  than  exstop  a  load  balancing  step  is  initiated. 

3.2  How  to  make  Load  Balancing  Steps 

To  combine  the  right  subblocks  into  one  block  the  workload  for  the  contact  part 
of  each  subblock  has  to  be  determined.  Since  these  subblocks  are  not  treated  sep¬ 
arately  in  CSA  the  workload  caused  by  them  can  not  be  measured  directly.  It  is 
measured  indirectly  through  the  number  of  nodes  and  proximity-pairs  (a  slave 
node  and  a  nearby  nonconnected  element)  which  have  been  detected  by  CSA 
inside  the  corresponding  subblocks.  A  good  estimation  model  for  the  workload 
was  achieved  by  a  regression  of  the  time  spent  for  certain  contact  search  com¬ 
putations  inside  a  block  on  the  size  of  these  sets  of  nodes  and  proximity-pairs. 
These  sets  represent  the  hierarchical  strategy  of  CSA:  elimination  of  nodes  and 
elements  from  current  contact  search  as  early  and  cheaply  as  possible  in  order 
to  minimize  computation  and  communication. 
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The  estimated  workload  of  contact  computation  for  a  processor  is  given  by  the 
sum  of  the  estimated  workloads  of  the  subblocks  which  form  a  block.  Improve¬ 
ment  is  assumed  if  through  a  recombination  of  subblocks  the  maximal  estimated 
workload  over  all  processors  can  be  reduced.  If  improvement  is  possible  the  cor¬ 
responding  processors  exchange  suitable  subblocks.  A  typical  block  for  an  eight 
processor  run  formed  by  a  recursive  spectral  bisectioning  tool  is  shown  by  the 
dark  region  in  Fig.  1.  The  block  is  connected  and  compact.  All  of  it  lies  in  a 
zone  with  likely  contact.  This  induces  load  unbalance  in  the  contact  part  of  the 
simulation.  After  an  exchange  of  subblocks  the  block  associated  with  the  same 


Fig.  1,  Block  of  a  BMW  model  from  sttmdard  peirtitioning  for  8  processors. 


processor  has  changed  into  one  with  two  connectivity  components  (see  Fig.  2). 
The  part  in  the  front  of  the  car  where  the  most  deformation  takes  place  has 
intensive  contact  whereas  the  part  in  the  rear  of  the  car  has  almost  no  contact. 
The  new  block  has  inclusive  communication  less  workload  for  the  contact  part 
of  the  simulation  than  the  old  one. 

The  new  algorithm  performs  this  operation  automatically  at  certain  instances 
(when  the  observed  load  unbalance  exceeds  the  given  threshold  parameter  exstop) 
to  minimize  the  maximal  workload  over  all  processors.  This  operation  introduces 
of  cause  additional  overhead  but  this  is  controlled  sufficiently  by  the  above- 
mentioned  threshold  parameters  exstop  and  nstep. 
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Fig.  2.  Block  of  a  BMW  model  from  over-partitioning  for  8  processors. 


4  Performance  Results 

First  performance  results  of  the  dynamic  load  balancing  strategy  have  been  ob¬ 
tained  for  the  contact  search  program  CSA.  CSA  is  implemented  in  Fortran?? 
with  the  message-passing  interface  MPI.  Since  CSA  with  dynamic  load  balanc¬ 
ing  is  not  yet  implemented  into  PAM-CRASH  we  use  interpolated  data  as  in  [5]. 
They  are  based  on  a  40%  off-set  crash  simulation  of  a  BMW  benchmark  model 
with  around  60000  elements. 

Simulations  of  40000  time-steps  on  an  IBM  SP2  are  considered  for  8,  16  and  32 
processors.  The  dynamic  load  balancing  partitions  are  created  from  a  fourfold 
over-partitioning,  i.e.  each  processor  has  got  four  subblocks  (see  Table  1).  There 
are  two  ’start’  partitions  for  the  dynamic  case.  The  first  are  the  ones  given  by  the 
partitioning  tool  directly  (the  four  subblocks  created  through  bisection  in  the 
last  two  steps  of  the  partitioning  tool  are  combined  to  one  block)  and  the  second 
are  the  static  load  balancing  partitions  created  from  a  fourfold  over-partitioning 
which  were  found  to  be  very  good  in  the  static  load  balancing  approach  [5]. 
Comparing  the  run  times  for  contact  search  without  dynamic  load  balancing 
for  this  two  partitions  (labeled  ’without’  and  ’static’  in  Table  1)  we  see  that 
in  case  of  8  processors  the  time  is  halved  and  in  case  of  16  resp.  32  processors 
an  improvement  of  36%  resp.  40%  was  obtained.  The  improvement  is  measured 
against  run  times  with  no  load  balancing. 

Considering  now  the  dynamic  load  balancing  results  we  have  varied  the  pa¬ 
rameters  nstep  (number  of  time-steps  until  the  next  workload  monitoring)  and 
exstop  (maximal  difference  in  measured  workload  over  all  processors) .  The  dy¬ 
namic  load  balancing  starts  from  the  partition  without  load  balancing  and  from 
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Table  1.  Dynamic  load  balancing  results  for  40000  time-steps  with  fourfold  over¬ 
partitioning. 


number 

number 

load 

nstep  exstop  time 

improvement 

of  proc 

of  subblocks  bciJancing 

(%) 

(  sec) 

(%) 

8 

8 

without 

2615 

0 

8 

32 

static 

1269 

51 

8 

32 

dynamic 

500 

10 

1445 

45 

8 

32 

dynamic 

1000 

5 

1449 

45 

8 

32 

dynamic 

1000 

10 

1309 

50 

8 

32 

dynamic 

1000 

15 

1375 

47 

8 

32 

dynamic 

2000 

10 

1386 

47 

8 

32 

stat-)-dyn 

1000 

5 

1347 

48 

8 

32 

stat-l-dyn 

1000 

10 

1295 

50 

8 

32 

stat-l-dyn 

1000 

15 

1321 

49 

16 

16 

without 

1829 

0 

16 

64 

static 

1176 

36 

16 

64 

dynamic 

500 

15 

958 

48 

16 

64 

dynamic 

1000 

10 

949 

48 

16 

64 

dynamic 

1000 

15 

941 

48 

16 

64 

dynamic 

1000 

20 

933 

49 

16 

64 

dynamic 

2000 

15 

945 

48 

16 

64 

stat-l-dyn 

1000 

10 

929 

49 

16 

64 

stat-i-dyn 

1000 

15 

903 

51 

16 

64 

stat-l-dyn 

2000 

15 

888 

51 

16 

64 

stat-l-dyn 

2000 

15 

874 

52 

32 

32 

without 

1134 

0 

32 

128 

static 

685 

40 

32 

128 

dynamic 

500 

15 

843 

26 

32 

128 

dynamic 

1000 

10 

824 

28 

32 

128 

dynamic 

1000 

15 

778 

31 

32 

128 

dynamic 

1000 

20 

784 

31 

32 

128 

dynamic 

2000 

15 

776 

32 

32 

128 

stat-l-dyn 

1000 

10 

750 

34 

32 

128 

stat-l-dyn 

1000 

15 

725 

36 

32 

128 

stat-l-dyn 

1000 

20 

768 

32 

32 

128 

stat-fdyn 

2000 

15 

721 

36 

810 
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the  static  load  balancing  partitions  (’stat+dyn’,  see  Table  1).  The  computing 
times  for  the  latter  case  are  slightly  better  because  of  the  better  starting  parti¬ 
tion.  For  example  for  32  processors  the  improvement  rose  from  32%  to  36%.  It 
can  be  seen  that  the  times  for  dynamic  load  balancing  are  similar  to  the  times 
in  the  static  load  balancing  case.  In  case  of  16  processors  the  performance  was 
better.  The  exchanges  of  subblocks  become  less  frequent  while  the  simulation 
is  going  on.  Alltogether  between  50  and  100  exchanges  within  the  first  40000 
time-steps  were  observed  for  the  dynamic  cases  with  32  processors. 

The  two  parameters  nstep  and  exstop  have  to  be  adapted  to  the  crash  model. 
They  show  that  one  should  not  try  too  few  or  too  many  times  to  find  a  better 
partition  and  that  the  difference  in  workload  between  the  processors  must  be 
worthwhile.  In  Table  2  the  partitions  are  created  from  two-  or  eight-fold  over¬ 
partitioning.  While  the  twofold  over-partitioning  leads  to  big  subblocks  in  the 
front  with  intensive  contact  the  eight-fold  over-partitioning  leads  to  complex 
blocks  with  more  communication  (more  neighbours)  (see  Table  2)  but  better 
adjustments.  Since  the  eight-fold  over-partitioning  implies  the  fourfold  case  (the 
refined  partition  was  created  by  a  bisectioning  procedure)  a  more  elaborate  de¬ 
signed  objective  function  incorporating  communication  requirements  should  lead 
to  some  further  improvement.  It  seems  that  decreasing  the  granularity  further 
leads  to  more  (untractable)  overhead  (see  Table  2,  8  processor  case  with  128 
subblocks) . 


Table  2.  Dynamic  load  balancing  results  for  40000  time-steps  with  two-,  eight-  and 
sixteen-fold  over-partitioning. 

number  number  load  nstep  exstop  time  improvement 

of  proc  of  subblocks  bcJancing _ (  %)  (  sec)  (  %) 


8 

16 

1000 

15 

2004 

23 

8 

32 

dynamic 

1000 

15 

1375 

47 

8 

64 

dynamic 

1000 

15 

1275 

51 

8 

128 

dynamic 

1000 

15 

1428 

45 

16 

32 

dynamic 

1000 

15 

1306 

29 

16 

64 

dynamic 

1000 

15 

941 

48 

16 

128 

dynamic 

1000 

15 

900 

51 

32 

64 

dynamic 

1000 

15 

899 

21 

32 

128 

dynamic 

1000 

15 

778 

31 

5  Concluding  Remarks 

Future  work  especially  the  integration  of  CSA  with  dynamic  load  balancing  into 
PAM-CRASH  will  be  done  in  the  project  AUTOBENCH.  One  topic  will  be 
the  integration  of  communication  into  the  objective  function  for  determining 
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the  subblocks  to  exchange.  To  find  good  partitions  the  number  of  neighbouring 
blocks  in  a  partition  and  the  number  of  boundary  elements  of  these  blocks  will 
also  be  considered.  So  for  instance  the  rule  that  neighbours  in  the  front  should 
also  be  neighbours  in  the  rear  seems  to  be  a  good  strategy  since  it  leads  in  ten¬ 
dency  to  less  communication. 
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Abstract.  This  paper  presents  a  methodology  for  parallelization  of  the 
power  systems  composite  reliability  evaluation  using  Monte  Csirlo  simu¬ 
lation.  A  coarse  grain  parallelization  strategy  is  adopted,  where  the  ad¬ 
equacy  analyses  of  the  sampled  system  states  are  distributed  among  the 
processors.  An  asynchronous  parallel  algorithm  is  described  and  tested 
on  five  different  electric  systems.  The  paper  presents  results  of  cJmost  lin¬ 
ear  speedup  and  high  efficiency  obtained  on  a  four  nodes  IBM  RS /6000 
SP  distributed  memory  psirallel  computer. 


1  Introduction 

The  power  generation,  transmission  and  distribution  systems  constitute  a  basic 
element  in  the  economic  and  social  development  of  the  modern  societies.  For 
technical  and  economic  reasons,  these  systems  have  evolved  from  a  group  of  small 
and  isolated  systems  to  large  and  complex  interconnected  systems  with  national 
or,  even",  continental  dimensions.  For  this  reason,  failure  of  certain  components 
of  the  system  can  produce  disturbances  capable  of  affecting  a  great  number  of 
consumers.  On  the  other  hand,  due  to  the  sophistication  of  the  electric  and 
electronic  equipments  used  by  the  consumers,  the  demand  in  terms  of  the  power 
supply  reliability  has  been  increasing  considerably.  More  recently,  institutional 
changes  in  the  electric  energy  sector,  such  as  those  originated  by  deregulation 
policies,  privatizations,  environmental  restrictions,  etc.,  have  been  forcing  the 
operation  of  such  systems  closer  to  its  limits,  increasing  the  need  to  evaluate 
the  power  supply  interruption  risks  and  quality  degradation  in  a  more  precisely 
form . 

Probabilistics  models  have  been  largely  used  in  the  evaluation  of  the  power 
systems  performance.  Based  on  information  about  components  failures,  these 
models  allow  to  establish  system  performance  indexes  which  can  be  used  to  aid 
decision  making  relative  to  new  investments,  operative  policies  and  to  evalu¬ 
ate  transactions  in  the  electric  energy  market.  This  type  of  study  receives  the 
generic  name  of  Reliability  Evaluation  [1]  and  can  be  accomplished  in  the  gener¬ 
ation,  transmission  and  distribution  levels  or,  still,  combining  the  several  levels. 
The  composite  system  reliability  evaluation  refers  to  the  reliability  evaluation  of 
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power  systems  composed  of  generation  and  transmission  sub-systems,  object  of 
this  work. 

The  basic  objective  of  the  composite  generation  and  transmission  .system 
reliability  evaluation  is  to  assess  the  capacity  of  the  system  to  satisfy  the  power 
demand  at  its  main  points  of  consumption.  For  this  purpose,  it  is  considered  the 
possibility  of  occurrence  of  failures  in  both  generation  and  transmission  system 
components  and  the  impact  of  these  failures  in  the  power  supply  is  evaluated. 
There  are  two  possible  approaches  for  reliability  evaluation;  analytic  techniques 
and  stochastic  simulation  (Monte  Carlo  simulation  [2,3],  for  example).  In  the 
case  of  large  systems,  with  complex  operating  conditions  and  a  high  number  of 
severe  events,  the  Monte  Carlo  simulation  is,  generally,  preferable  due  to  the 
easiness  of  modeling  complex  phenomena. 

For  the  reliability  evaluation  based  on  Monte  Carlo  simulation,  it  is  necessary 
to  analyze  a  very  large  number  of  system  operating  states.  This  analysis,  in 
general,  includes  load  flow  calculations,  static  contingencies  analysis,  generation 
re-scheduling,  load  shedding,  etc.  In  several  cases,  this  analysis  must  be  repeated 
for  several  different  load  levels  and  network  topology  scenarios. 

The  reliability  evaluation  of  large  power  systems  may  demand  hours  of  com¬ 
putation  on  high  performance  workstations.  The  majority  of  the  computational 
effort  is  concentrated  in  the  system  states  analysis  phase.  This  analysis  may 
be  performed  independently  for  each  system  state.  The  combination  of  elevated 
processing  requirements  with  the  concurrent  events  characteristic  suggests  the 
application  of  parallel  processing  for  the  reduction  of  the  total  computation  time. 

This  paper  describes  some  results  obtained  through  a  methodology  under 
development  for  power  system  composite  reliability  evaluation  on  parallel  com¬ 
puters  with  distributed  memory  architecture  and  communication  via  message 
passing.  The  methodology  is  being  developed  using  as  reference  element  a  se¬ 
quential  program  for  reliability  evaluation  used  by  the  Brazilian  electric  energy 
utilities  [4]  . 

2  Composite  Reliability  Evaluation 

The  composite  generation  and  transmission  system  reliability  evaluation  consists 
of  the  calculation  of  several  performance  indexes,  such  as  the  Lost  of  Load  Prob¬ 
ability  (LOLP),  Expected  Power  Not  Supplied  (EPNS),  Lost  of  Load  Frequency 
(LOLF),  etc.,  using  a  stochastic  model  for  the  electric  system  operation.  The 
conceptual  algorithm  for  this  evaluation  is  as  follows: 

1.  Select  an  operating  scenario  ^  corresponding  to  a  load  level,  components 
availability,  operating  conditions,  etc. 

2.  Calculate  the  value  of  an  evaluation  function  F’(r)  which  quantifies  the  ef¬ 
fect  of  violations  in  the  operating  limits  in  this  specific  scenario.  Corrective 
actions  such  as  generation  rescheduling,  load  shedding  minimization,  etc., 
can  be  included  in  this  evaluation. 

3.  Update  the  expected  value  of  the  reliability  indexes  based  on  the  result  ob¬ 
tained  in  step  2. 
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4.  If  the  accuracy  of  the  estimates  is  acceptable,  terminate  the  process.  Other¬ 
wise,  return  to  step  1. 

Consider  a  system  with  n  components  (transmission  lines,  transformers,  elec¬ 
tric  loads,  etc.).  An  operating  scenario  for  this  system  is  given  by  the  random 
vector: 


X  =  {xi,X2,...,Xm)  ,  (1) 

where  x,  is  the  state  of  the  ?-th  component.  The  group  of  all  possible  operating 
states,  obtained  by  all  the  possible  combinations  of  components  states,  is  called 
state  space  and  represented  by  X.  In  Monte  Carlo  simulation,  step  1  of  the 
previous  algorithm  consists  of  obtaining  a  sample  of  vector  x£X,  by  sampling 
the  random  variables  probability  distributions  corresponding  to  the  components 
operating  states,  using  a  random  number  generator  algorithm. 

In  step  2  of  the  previous  algorithm,  it  is  necessary  to  simulate  the  operating 
condition  of  the  system  in  the  respective  sampled  states,  in  order  to  determine  if 
the  demand  can  be  satisfied  without  operation  restrictions  violation.  This  simula¬ 
tion  demands  the  solution  of  a  contingency  analysis  problem  [5]  and,  eventually, 
of  an  optimal  load  flow  problem  [6]  to  simulate  the  generation  re-scheduling  and 
the  minimum  load  shedding.  In  the  case  of  large  electric  systems,  these  simula¬ 
tions  require  high  computational  effort  in  relation  to  that  necessary  for  the  other 
steps  of  the  algorithm  [7]. 

The  reliability  indexes  calculated  at  step  3  correspond  to  estimates  of  the 
expectation  of  different  evaluation  functions  F{x),  obtained  for  N  system  state 
samples  by: 

1  ^ 

^  fc=i 

The  Monte  Carlo  simulation  accuracy  may  be  expressed  by  the  coefficient 
of  variation  a,  which  is  a  measure  of  the  uncertainty  around  the  estimate,  and 
defined  as: 


_  VvlE[F]) 

E[F] 


(3) 


The  convergence  criterion  usually  adopted  is  the  coefficient  of  variation  of 
the  EPNS  index,  which  has  the  worst  convergence  rate  of  all  reliability  indexes. 


3  Parallelization  Strategy 

The  composite  reliability  evaluation  problem  can  be  summarized  in  three  main 
functions:  the  system  states  selection,  the  adequacy  analysis  of  the  selected  sys¬ 
tem  states  and  the  reliability  indexes  calculation.  As  described  before,  an  ad¬ 
equacy  analysis  is  performed  for  each  state  selected  by  sampling  in  the  Monte 
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Carlo  simulation,  i.e.,  the  system  capacity  to  satisfy  the  demand  without  violat- 
ing  operation  and  security  limits  is  verified. 

The  algorithm  is  inherently  parallel  with  a  high  degree  of  task  decoupling 
The  system  states  analyses  are  completely  independent  from  each  other  and, 
111  a  coaise  grain  parallelization  strategy,  it  is  only  necessary  to  communicate  at 
three  different  situations: 

1.  For  the  initial  distribution  of  data,  identical  for  all  processors  and  executed 
once  during  the  whole  computation; 

2.  For  the  final  grouping  of  the  partial  results  calculated  in  each  processor,  also 
executed  only  once  ;  and 

3.  For  control  of  the  global  parallel  convergence,  which  needs  to  be  executed 
several  times  during  the  simulation  process,  with  frequency  that  obeys  some 
convergence  control  criteria. 

The  basic  configuration  for  parallelization  of  this  problem  is  the  master-slaves 
paradigm,  in  which  a  process,  denominated  master,  is  responsible  for  acquiring 
the  data,  distributing  them  to  the  slaves,  controlling  the  global  convergence, 
receiving  the  partial  results  from  each  slave,  calculating  the  reliability  indexes 
and  generating  reports.  The  slaves  processes  are  responsible  for  analyzing  the 
system  states  allocated  to  them,  sending  their  local  convergence  data  to  the 
master  and,  at  the  end  of  the  iterative  process,  also  sending  their  partial  results. 
It  is  important  to  point  out  that,  in  architectures  where  the  processors  have 
equivalent  processing  capacities,  the  master  process  should  also  analyze  system 
states  in  order  to  improve  the  algorithm  performance.  For  purposes  of  this  work, 
each  process  is  allocated  to  a  different  processor  and  is  referred  to  simply  as 
processor  from  now  on. 

The  main  points  that  had  to  be  solved  in  the  algorithm  parallelization  were 
the  system  states  distribution  philosophy  and  the  parallel  convergence  control 
criteria,  in  order  to  have  a  good  load  balancing.  These  questions  are  dealt  with 
in  the  next  subsections. 


^•1  System  States  Distribution  Philosophy 

The  most  important  problem  to  be  treated  in  parallel  implementations  of  Monte 
Carlo  simulation  is  to  avoid  the  existence  of  correlation  between  the  sequences 
of  random  numbers  generated  in  different  processors.  If  the  sequences  are  cor¬ 
related  in  some  way,  the  information  produced  by  different  processors  will  be 
redundant  and  they  will  not  contribute  to  increase  the  statistical  accuracy  of 
the  computation,  degrading  the  algorithm  performance.  In  some  Monte  Carlo 
applications,  correlation  introduces  interferences  that  produce  incorrect  results. 
To  initialize  the  random  numbers  generator  with  different  seeds  for  each  proces¬ 
sor  is  not  a  good  practice,  because  it  can  generate  sequences  that  are  correlated 
to  each  other  [9], 

In  the  system  states  distribution  philosophy  adopted,  the  system  states  are 
generated  directly  at  the  processors  in  which  they  will  be  analyzed.  For  this  pur¬ 
pose,  all  processors  receive  the  same  seed  and  execute  the  same  random  numbers 
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sampling,  generating  the  same  system  states.  Each  processor,  however,  starts  to 
analyze  the  state  with  a  number  equal  to  its  rank  in  the  parallel  computation 
and  analyzes  the  next  states  using  as  step  the  number  of  processors  involved 
in  the  computation.  Supposing  that  the  number  of  available  processors  is  4, 
then  processor  1  analyzes  states  numbered  1,5,9,. . . ,  processor  2  analyzes  states 
numbered  2,6,10,. . . ,  and  so  on. 


3.2  Parallel  Convergence  Control 

Three  different  parallelization  strategies  for  the  composite  reliability  evaluation 
problem  were  tried  [10]  with  variations  over  the  task  allocation  graph,  the  load 
distribution  criteria  and  the  convergence  control  method.  The  strategy  that  pro¬ 
duces  best  results  in  terms  of  speedup  and  scalability  is  an  asynchronous  one 
that  will  be  described  in  details  in  this  section.  The  task  allocation  graph  for 
this  asynchronous  parallel  strategy  is  shown  in  Fig.  1,  where  p  is  the  number 
of  scheduled  processors.  Each  processor  has  a  rank  in  the  parallel  computa¬ 
tion  which  varies  from  0  to  (p-1),  0  referring  to  the  master  process.  The  basic 
tasks  involved  in  the  solution  of  this  problem  can  be  classified  in  five  types:  I 
-  Initialization,  A  -  States  Analysis,  C  -  Convergence  Control,  P  -  Iterative 
Process  Termination  and  F  -  Finalization  (calculation  of  reliability  indexes  and 
generation  of  reports).  A  subindex  i  associated  with  task  T  means  it  is  allocated 
to  processor  i.  A  superindex  j  associated  with  Ti  means  the  _f-th  execution  of 
task  Tby  processor  i. 


Fig.  1.  Task  Allocation  Graph 


The  problem  initialization,  which  consists  of  the  data  acquisition  and  some 
initial  computation,  is  executed  by  the  master  processor,  followed  by  a  broadcast 
of  the  data  to  all  slaves.  All  processors,  including  the  master,  pass  to  the  phase 
of  states  analysis,  each  one  analyzing  different  states.  After  a  time  interval  ii/. 
each  slave  sends  to  the  master  the  data  relative  to  its  own  local  convergence, 
independently  of  how  many  states  it  has  analyzed  so  far,  and  then  continues  to 
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analyze  other  states.  At  the  end  of  another  At  time  interval,  a  new  message  is 
sent  to  the  master,  and  this  process  is  periodically  repeated. 

When  the  master  receives  a  message  from  a  slave,  it  verifies  the  status  of 
the  global  parallel  convergence.  If  it  has  not  been  achieved,  the  master  goes 
back  to  the  system  states  analysis  task  until  a  new  message  arrives  and  the 
parallel  convergence  needs  to  be  checked  again.  When  the  parallel  convergence 
is  detected  or  the  maximum  number  of  state  samples  is  reached,  the  master 
broadcasts  a  message  telling  the  slaves  to  terminate  the  iterative  process,  upon 
what  the  slaves  send  back  their  partial  results  to  the  master.  The  master  then 
calculates  the  reliability  indexes,  generate  reports  and  terminate. 

In  this  parallelization  strategy,  there  is  no  kind  of  synchronization  during 
the  iterative  process  and  the  load  balancing  is  established  by  the  processors 
capacities  and  the  system  states  complexity.  The  precedence  graph  [11]  is  shown 
in  Fig.  2,  where  the  horizontal  lines  are  the  local  time  axis  of  the  processors.  In 
this  figure  only  the  master  and  one  slave  are  represented.  The  horizontal  arches 
represent  the  successive  states  by  which  a  processor  passes.  The  vertices  represent 
the  messages  sending  and  receiving  events.  The  messages  are  represented  by  the 
arches  linking  horizontal  lines  associated  with  different  processors. 


Fig.  2.  Precedence  Graph 


An  additional  consideration  introduced  by  the  parallelization  strategy  is  the 
redundant  simulation.  During  the  last  stage  of  the  iterative  process,  the  slaves 
execute  some  analyses  that  are  beyond  the  minimum  necessary  to  reach  conver¬ 
gence.  The  convergence  is  detected  based  on  the  information  contained  in  the 
last  message  sent  b\'  the  slaves  and,  between  the  shipping  of  the  last  message 
and  the  reception  of  the  message  informing  of  the  convergence,  the  slaves  keep 
analyzing  more  states.  This,  however,  does  not  imply  in  loss  of  time  for  the 
computation  as  a  whole,  since  no  processor  gets  idle  any  time.  The  redundant 
simulation  is  used  for  the  final  calculation  of  the  reliability  indexes,  generating 
indexes  .still  more  accurate  than  the  sequential  solution,  since  in  Monte  Carlo 
methods  the  uncertainty  of  the  estimate  is  inversely  proportional  to  the  number 
of  analyzed  states  samples. 
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4  Implementation 

4.1  Message  Passing  Interface  (MPI) 

MPI  is  a  standard  and  portable  message  passing  system  designed  to  operate 
on  a  wide  variety  of  parallel  computers  [12].  The  standard  defines  the  syntax 
and  semantics  of  the  group  of  subroutines  that  integrate  the  library.  The  main 
goals  of  MPI  are  portability  and  efficiency.  Several  efficient  implementations  of 
MPI  already  exist  for  different  computer  architectures.  The  MPI  implementation 
used  in  this  work  is  the  one  developed  by  IBM  for  AIX  that  complies  with  MPI 
standard  version  1.1. 

Message  passing  is  a  programming  paradigm  broadly  used  in  parallel  comput¬ 
ers,  especially  scalable  parallel  computers  with  distributed  memory  and  networks 
of  workstations  (NOWs).  The  basic  communication  mechanism  is  the  transmittal 
of  data  between  a  pair  of  processes,  one  sending  and  the  other  receiving.  There 
are  two  types  of  communication  functions  in  MPI:  blocking  and  non-blocking.  In 
this  work,  the  non-blocking  send  and  receive  functions  are  used,  what  allow  the 
possible  overlap  of  message  transmittal  with  computation  and  tend  to  improve 
the  performance. 

Other  important  aspects  of  communication  are  related  to  the  semantics  of  the 
communication  primitives  and  the  underlying  protocols  that  implement  them. 
MPI  offers  four  modes  for  point  to  point  communication  which  allow  to  choose 
the  semantics  of  the  send  operation  and  to  influence  the  protocol  of  data  trans¬ 
ferring.  In  this  work,  the  Standard  mode  is  used,  in  which  it  is  up  to  MPI  to 
decide  whether  outgoing  messages  are  buffered  or  not  based  on  buffer  space 
availability  and  performance  reasons. 


4.2  Parsdlel  Computer 

The  IBM  RS/6000  SP  Scalable  POWERparallel  System  [13]  is  a  scalable  parallel 
computer  with  distributed  memory.  Each  node  of  this  parallel  machine  is  a  com¬ 
plete  workstation  with  its  own  CPU,  memory,  hard  disk  and  net  interface.  The 
architecture  of  the  processors  may  be  POWER2/PowerPC  604  or  Symmetrical 
Proce.ssor  (SMP)  PowerPC.  The  nodes  are  interconnected  by  a  high  performance 
switch  dedicated  exclusively  for  the  execution  of  parallel  programs.  This  switch 
can  establish  direct  connection  between  any  pair  of  nodes  and  one  full-duplex 
connection  can  exist  simultaneously  for  each  node. 

The  parallel  platform  where  this  work  was  implemented  is  an  IBM  RS/6000 
SP  with  4  POWER2  processors  interconnected  by  a  high  performance  switch 
of  40  MB/s  full-duplex  bandwidth.  Although  the  pick  performance  for  floating 
point  operations  is  266  MFLOPS  for  all  nodes  of  this  machine,  there  are  differ¬ 
ences  related  to  processor  type,  memory  and  processor  buses,  cache  and  RAM 
memory  that  can  be  more  or  less  significant  depending  on  the  characteristics  of 
the  program  in  execution. 
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5  Results 

5.1  Test  Systems 

Five  different  electric  systems  were  used  to  verify  the  performance  and  scalability 
of  the  parallel  implementation.  The  first  one  is  the  lEEE-RTS  standard  reliability 
test  system  for  reliability  evaluation  [14].  The  second  one,  CIGRE-NBS  is  a  repre¬ 
sentation  of  the  New  Brunswick  Power  System  proposed  by  CIGRE  as  a  standard 
for  reliability  evaluation  methodology  comparison  [15].  The  third,  fourth  and 
fifth  ones  are  representations  of  the  actual  Brazilian  power  system,  with  actual 
electric  characteristics  and  dimensions,  for  region  North-Northeastern  (NNE), 
Southern  (SUL)  and  Southeastern  (SE),  respectively.  The  main  data  for  the  test 
sj'stems  are  shown  in  Table  1.  It  was  adopted  a  convergence  tolerance  of  5%  in 
the  EPNS  index  for  all  test  systems  reliability  evaluation. 


Table  1.  Test  Systems 


System 

Nodes 

Circuits 

Arecis 

RTS 

24 

38 

2 

89 

126 

4 

NNE 

89 

170 

6 

SUL 

660 

1072 

18 

IBM 

1389 

2295 

48 

5.2  Results  Analysis 

The  speedups  and  efficiencies  achieved  by  the  parallel  code  for  the  five  test 
systems  on  2,  3  and  4  processors,  together  with  the  CPU  time  of  the  sequential 
code  (1  processor),  are  summarized  in  Table  2. 


Table  2.  Results 


System 

CPU  time 

Speedup 

Efficiency  | 

p=l 

IS 

B 

!iSl 

E3i 

■aiBM 

35.03  sec 

1.87 

1  NBS 

24.36  min 

97.90 

97.281 

iUiEBiillil 

EES 

SUL 

gga 

gQQI 

96.281 

SE 

8.52  hour 

glQII 

As  it  can  be  seen,  the  results  obtained  are  very  good,  with  speedup  almost 
linear  and  efficiency  close  to  100  %  for  the  larger  systems.  The  asynchronous 
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parallel  implementation  provides  practically  ideal  load  balancing  and  negligible 
synchronization  time.  The  communication  time  for  broadcasting  of  the  initial 
data  and  grouping  of  the  partial  results  is  negligible  compared  to  the  total  time 
of  the  simulation.  The  communication  time  that  has  significant  effect  in  the 
parallel  performance  is  the  time  consumed  in  exchanging  messages  for  controlling 
the  parallel  convergence.  The  time  spent  by  the  processors  sampling  states  that 
are  not  analyzed  by  them,  due  to  the  states  distribution  philosophy  adopted,  is 
also  negligible  compared  to  the  total  computation  time. 

The  algorithm  also  presented  a  good  scalability  in  terms  of  number  of  pro¬ 
cessors  and  dimension  of  the  test  systems,  with  almost  constant  efficiency  for 
different  numbers  of  processors.  This  good  behavior  of  the  parallel  solution  is 
due  to  the  combination  of  three  main  aspects: 

1.  The  high  degree  of  parallelism  inherent  to  the  problem, 

2.  The  coarse  grain  parallelization  strategy  adopted  and 

3.  The  asynchronous  implementation  developed. 

Figure  3  shows  the  speedup  curve  for  the  RTS  and  NBS  test  systems  and 
Fig.  4  for  the  three  actual  Brazilian  systems. 


Fig.  3.  Speedups  -  RTS  and  NBS 


A  comparison  of  the  main  indexes  calculated  in  both  sequential  and  par¬ 
allel  (4  processors)  implementations  can  be  done  based  on  Table  3,  where  the 
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Fig.  4.  Speedups  -  NNE,  SUL  cind  SE 


Expected  Power  Not  Supplied  (EPNS)  is  given  in  MW  and  the  Lost  of  Load 
Frequency  (LOLF)  is  given  in  occurrences/year. 


Table  3.  Reliability  Indexes 


System 

LOLP 

EPNS 

LOLF 

«(%)  i 

ESI 

:|SI 

ISl 

P=1 

ISI 

E5II 

RTS 

fiign 

Biro 

Haro 

EBI 

NBS 

BWiin 

QQQ 

QQI 

QQ 

ibiai 

10.88 

EBI 

NNE 

BligH 

Hill 

23.97 

24.23 

EQQQ] 

EKMill 

SUL 

69.76 

69.61 

SE 

IBu 

114.22 
-  ..  .... 

114.56 

As  it  can  be  seen,  the  results  obtained  are  statistically  equivalent,  with  the 
parallel  results  slightly  more  accurate  (smaller  coefficient  of  variation  o).  This 
is  due  to  the  convergence  detection  criteria.  In  the  sequential  implementation, 
the  convergence  is  checked  at  each  new  state  analyzed.  In  the  parallel  imple¬ 
mentation,  there  is  no  sense  in  doing  this,  since  it  would  imply  in  a  very  large 
number  of  messages  between  processors.  The  parallel  convergence  is  checked  in 
chunks  of  states  analyzed  at  different  processors,  as  described  before,  what  may 
lead  to  a  different  total  of  analyzed  states  when  the  convergence  is  detected. 
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Another  factor  that  contributes  to  the  greater  accuracy  of  the  parallel  solution 
is  the  redundant  simulation  described  earlier. 

Another  interesting  observation  is  that  the  convergence  path  for  the  parallel 
implementation  is  not  necessarily  the  same  as  the  one  for  the  sequential  imple¬ 
mentation.  The  states  adequacy  analyses  may  demand  different  computational 
times  depending  on  the  complexity  of  the  states  being  analyzed.  Besides,  faster 
processors  may  analyze  more  states  than  slower  ones  in  the  same  time  interval. 
The  combination  of  these  two  characteristics  may  lead  to  a  different  sequence  of 
states  analyzed  by  the  overall  simulation  process,  resulting  in  a  different  conver¬ 
gence  path  from  the  sequential  code,  but  producing  practically  the  same  results. 

6  Conclusions 

The  power  systems  composite  reliability  evaluation  using  Monte  Carlo  simulation 
demands  high  computation  effort  due  to  the  large  number  of  states  that  need  to 
be  analyzed  and  to  the  complexity  of  these  states  analyses. 

Since  the  adequacy  analysis  of  the  system  states  can  be  performed  indepen¬ 
dently  from  each  other,  the  use  of  parallel  processing  is  a  powerful  tool  for  the 
reduction  of  the  total  computation  time. 

This  paper  presented  a  composite  reliability  evaluation  methodology  for  a 
distributed  memory  parallel  processing  environment,  A  coarse  grain  parallelism 
was  adopted,  where  the  processing  grain  is  composed  by  the  adequacy  analysis 
of  several  states.  In  this  methodology,  the  states  to  be  analyzed  are  generated 
directly  at  the  different  processors,  using  a  distribution  algorithm  that  is  based 
on  the  rank  of  the  processor  in  the  parallel  computation. 

A  parallel  implementation  was  developed  where  the  Monte  Carlo  simula¬ 
tion  convergence  is  controlled  in  a  totally  asynchronous  way.  This  asynchronous 
implementation  has  a  practically  ideal  load  balancing  and  worthless  synchro¬ 
nization  time. 

The  results  obtained  in  a  4  nodes  IBM  RS/6000  SP  parallel  computer  for 
five  electric  systems  are  very  good,  with  practically  linear  speedup,  close  to  the 
theoretical,  and  efficiencies  also  close  to  100%.  The  reliability  indexes  calculated 
in  both  the  sequential  and  parallel  implementations  are  statistically  equivalent. 

A  point  being  explored  in  continuation  to  this  work  is  the  migration  of  the 
methodology  developed  from  a  distributed  memory  parallel  computer  to  a  net¬ 
work  of  workstations.  To  ally  the  computation  time  reduction  achieved  by  this 
parallel  methodology  with  the  use  of  networks  of  workstations  alreadj'  available 
at  the  electric  energy  companies  is  of  great  interest  from  the  economic  point  of 
view. 
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Abstract.  Two  parallel  codes  have  been  designed  with  the  aim  of  solving  a 
fluid-dynamic  problem  that  appears  in  a  real  industrial-scale  glass 
manufacturing  process.  To  solve  this  problem,  the  ADI  method  has  been 
implemented  with  the  two  parallel  programming  paradigms.  Taking  into 
account  the  comparison  of  programming  effort  versus  portability,  the  two 
solutions  offer  different  advantages.  In  this  paper,  we  also  evaluate  the  high 
communication  costs  exhibited  by  the  inherent  parallelism  in  the  ADI  method. 


1.  Introduction 

This  paper  focuses  on  a  R+D  project  developed  with  the  aim  of  obtaining  a  parallel 
code  that  solves  the  problems  that  arise  when  designing  new  industrial  glass 
manufacturing  plants.  Up  to  now,  this  type  of  industrial  problem  has  not  been 
affordable  for  obtaining  a  successful  computer-based  tool  in  order  to  model  such 
processes. 

The  organisation  of  this  paper  is  as  follows:  this  section  describes  the  modelling  of 
the  physical  process  and  the  mathematical  approach.  Section  2  introduces  the  parallel 
strategies  applied  to  the  parallel  machines  that  have  been  tested  with  our  benchmark. 
Section  3  shows  the  most  relevant  performance  results  and  explains  the  different 
behaviours  of  the  application.  Finally,  Section  4  concludes  with  future  work  that  must 
follow  our  preliminary  results. 


'  This  work  has  been  supported  and  co-funded  by  the  European  Commission  within  the  ESPRIT 
IV,  (Project  Ref.  21037,  HPCN  Initiative  -  PCI  II)  and  the  Spanish  CICYT  under  contracts 
TIC95-0378  and  TIC97-1432-CE. 
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1.1.  The  Physical  Process 

The  actual  glass  manufacturing  process  can  be  described  as  follows.  In  the 
production  plant,  there  is  a  closed  vat  about  60x8  metres  that  contains  a  thin  layer  of 
liquid  tin.  In  the  head  of  this  vat,  the  glass  paste  is  deposited  at  1200  C  over  the  tin. 
After  a  mechanical  process  of  extension,  sheets  of  different  thickness  are  obtained  at 
the  opposite  end  of  the  vat,  at  600  C.  The  plastic  deformations  induced  in  the  glass  are 
produced  by  towing  the  glass  sheet  along  the  vat.  Because  of  the  glass  displacement, 
tin  currents  are  generated,  provoking  several  mass  transportation  phenomena,  and 
therefore,  different  patterns  of  thermal  energy  are  induced.  These  thermal  conditions 
lead  to  heat  imbalances  that  must  be  modelled  with  the  aim  of  controlling  the  different 
mechanical  and  optical  glass  features  to  obtain  the  optimal  quality.  Up  to  now,  these 
processes  were  controlled  by  human  expertise,  but  for  improving  new  prototype 
designs,  a  computer  based  methodology  ought  to  be  implemented. 

The  manufacturing  modelling  can  be  described  in  terms  of  two  basic  processes. 
First,  a  hydrodynamic  process  is  necessary  to  predict  the  tin  currents  generated  below 
the  glass  sheet.  On  the  other  hand,  a  thermal  process  must  be  modelled  for  forecasting 
thermal  maps  of  both  the  glass  sheet  and  the  underlying  tin  layer.  Both  processes  are 
interrelated  with  each  other. 


Glass  deposit 


Fig.  1.  Scenario  representation:  the  production  vat  and  its  phases,  the  four-layer 
model,  and  a  generic  finite-difference  discretized  element  (m,n). 

In  order  to  achieve  accurate  results  for  the  analysis  and  evaluation  of  new  designs, 
some  requirements  are  compulsory.  The  required  degree  of  observability  obliges  us  to 
obtain  a  detail  level  around  6  cm  to  reproduce  all  heterogeneous  phenomena.  This 
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implies  obtaining  a  mathematical  model  based  on  a  finite-difference  scheme, 
obtaining  a  discretization  mesh  of  the  vat  around  1000x150  elements.  As  figure  1 
shows,  this  discretization  produces  a  rectangular  square-cell  mesh,  delimiting  the 
computation  domain.  Obviously,  the  discretization  phase  takes  into  account  all  the 
elements  involved  in  the  physical  process,  such  as  cooling  and  heating  elements,  flux 
forming  needs,  tracking  wheels  to  steer  the  glass  sheet,  etc.  With  this  accuracy  level,  a 
non-optimised  sequential  program  with  the  proposed  modelling  takes  more  than  a 
week  running  in  a  HP9000/735  workstation,  to  simulate  20  minutes  of  a  real  process  - 
time  to  obtain  the  steady  state  of  the  process.  This  amount  of  computation  time  makes 
the  tool  impractical  and  it  can  only  be  afforded  by  means  of  parallel  processing. 


1.2.  The  Mathematical  Scheme 

Due  to  the  singularity  of  the  process,  appropriate  numerical  methods  must  be 
chosen.  The  unknowns  of  interest  are  the  vectorial  components  of  the  tin  velocity  (U 
and  V  -  in  a  discretized  notation  in  figure  1),  as  well  as  the  evolution  of  the  tin  level 
with  respect  to  the  free  surface  (identified  by  E).  The  glass  and  tin  temperature 
evolution  are  also  computed.  We  can  describe  the  tin  effect  as  a  transportation  process 
that  can  be  approximated  as  a  long  wave  behaviour,  and  permits  important  reduction 
in  the  modelling  that  represents  such  movement.  This  assumption  is  based  on 
considering  the  vat  length  greater  than  its  depth,  allowing  us  to  assume  that  flux 
evolution  is  mainly  horizontal,  and  therefore  that  a  2D  model  is  sufficient  and  quite 
reliable.  Considering  the  discretization  mesh  in  figure  1,  to  fully  determine  the  long 
wave  equation,  two  initial  expressions  must  be  introduced:  the  momentum  flux 
equation  (1)  and  the  mass  conservation  for  incompressible  fluids  equation  (2),  based 
on  fluid  dynamics  found  in  [1][2].  These  equations  are  expressed  in  the  following 
form: 


Du  ^ 

A  u  =  0 

where: 

u  =  tin  vectorial  velocity, 
p  =  pressure 

F  =  volume  forces  (gravity) 

T  =  shear  stresses  in  the  n-th  surface  and  m-th  direction. 

lilli 

The  thermal  model  solves  a  vertical-layer  interaction  problem,  consisting  in 
determining  the  thermal  relationship  between  a  four-layer  based  model.  The  vat  is 
composed  of  four  vertical  layers,  the  refractary  cooling  block  in  the  floor,  the  tin  layer, 
the  glass  sheet  and  the  enclosed  atmosphere.  To  compute  the  temporal  thermal 
evolution,  the  UPWIND  method  is  used  [13],  with  the  typical  conduction-advection 
equations  involved  in  its  solution.  Moreover,  both,  the  hydrodynamic  and  the  thermal 


(1) 

(2) 
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processes,  interact  by  means  of  the  density  gradient  terms  of  the  tin  and  the  glass,  that 
change  continuously,  associated  with  each  discretized  element. 

In  the  hydrodynamic  process,  the  longitudinal  and  transversal  tin  velocities  - 
U(m,n,t)  and  V(m,n,t)  -and  the  free-surface  tin  level  -  E(m,n,t),  are  computed.  With 
these  unknowns  solved,  the  tin  density  is  affected  by  the  thermal  evolution,  which  also 
affects  its  velocity.  The  temperature  in  each  cell,  T(m,n,z,t),  is  the  next  unknown 
solved  in  the  same  time  iteration.  The  simulation  process  reaches  its  end,  when  the 
transient  state  has  finished. 

The  model  is  based  on  explicit  equations,  and  some  non-linear  elements  can 
intervene,  the  time  step  of  the  simulation  process  is  determined  to  assure  the  model 
stability  and  the  final  convergence.  This  time  step  is  chosen  applying  the  Courant  and 
Peclet  criteria  [1].  These  criteria  also  permit  us  to  perform  the  thermal  solution  not  in 
all  the  hydrodynamic  time  steps,  because  the  thermal  evolution  is  slower.  The  Courant 
and  Peclet  solution,  associated  with  the  mesh  dimension,  determines  when  both 
processes  interact. 

The  discretization  process  transforms  the  differential  equations  (1)  and  (2)  in  two 
subsets  of  partial  differential  equation  systems.  One  subset  is  that  obtained  for  the  X- 
dimension,  and  the  other,  for  the  Y-dimension.  For  each  time  step,  U,  V  and  E  are 
computed  for  the  “t-t-l”  instant. 


+r, +...=0  w 

Af  Ax 


Af  2Ax  Ap 


Non-linear  terms  appear  and  they  are  modelled  as  a  collection  of  perturbations  Tn, 
that  are  introduced  in  the  system  in  each  point  of  the  mesh,  for  each  instant  “t”.  These 
terms  represent  different  kinds  of  frictions,  the  density  variations  and  other  second- 
order  terms.  Notice  that  solutions  in  “t+1”  consider  instant  “t”  and  “t+1/2”.  After 
expanding  the  discretized  equations,  following  a  bidimensional  decomposition  where 
the  X  dimension  is  divided  in  M  elements,  and  the  Y  dimension  in  N,  the  ADI  method 
is  employed  to  solve  the  hydrodynamic  process  [3][4].  This  scheme  leads  to  solving  M 
tridiagonal  systems  of  linear  equations  of  size  NxN  in  the  ADI  X-sweep  phase,  and  N 
tridiagonal  systems  of  MxM  elements  in  the  orthogonal  ADI-sweep. 
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2.  Parallelization  Approach 


2.1.  Initial  Considerations 

The  high  time  consuming  feature  presented  by  the  application  discards  any  other 
computer-based  approach  except  a  parallelization  scheme.  To  obtain  a  parallel 
algorithm  for  the  ADI  method,  we  propose  to  divide  M  cells  along  k  processors,  to 
solve  the  tridiagonal  systems  in  the  X  dimension,  performing  the  communication 
between  them,  dividing,  once  again  N  cells  into  k  processors,  and  repeating  the  Y- 
sweep  phase,  collecting  the  solutions  in  a  new  inter-processor  communication,  and 
initialising  another  time  step. 

Prior  to  the  parallelization  task,  an  optimized  sequential  code  was  developed  with 
the  aim  of  generating  a  good  reference  to  make  further  comparisons,  such  as  speed-up, 
time  execution  bounds,  etc.  Production  Fortran  77  was  the  chosen  programming 
language,  and  both  parallel  programming  paradigms  were  designed  for  the  application, 

i.  e.,  a  shared  memory  code,  as  well  as  a  message-passing  one. 

To  analyse  the  parallel  strategies,  we  identify  the  computation  and  communication 
phases.  First,  a  domain  decomposition  is  performed  to  assign  each  processor  a  sub- 
domain.  On  the  other  hand,  in  the  inter-processor  communication,  an  all-to-all 
personalised  communication  pattern  [5]  arises  as  the  bottleneck  of  the  algorithm. 


2.2.  The  Shared-Memory  Scheme 

The  strategy  of  programming  the  parallel  tasks  in  the  code,  i.e.  ADI  and  UPWIND 
methods,  was  done  via  loop  partitioning.  This  SPMD  approach  is  quite  easy  to 
perform  if  no  dependencies  arise  among  sub-domain  partitions.  In  our  case,  each 
dimensional  sweep  can  be  done  totally  free  of  domain  border  dependencies  to 
compute  the  solution  in  “t+1”.  Processors  only  need  already  computed  border  data, 
i.e.,  from  the  previous  iteration,  and  can  find  these  data  in  global  memory.  Therefore, 
in  each  time  step,  if  the  domain  has  been  decomposed  into  MxN  elements,  the 
following  steps  are  performed: 

1 .  Consider  M  cells  in  the  X  dimension,  each  processor  computes  M/k  tridiagonal 
systems  of  size  NxN.  U(t+1 )  and  E(t+l/2)  are  solved. 

2.  After  this  first  sweep,  all  processors  must  await  the  other  processors  in  a  barrier. 

3.  Now,  processors  perform  the  Y  sweep  with  N/k  tridiagonal  systems  of  size  MxM. 
V(t+i)  and  E(t-i-l)  are  computed. 

4.  When  this  sweep  is  performed,  another  barrier  point  is  reached. 

5.  If  Peclet’s  condition  is  true,  the  thermal  process  must  be  performed: 


829 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


5a.  A  X-dimension  partition  is  done.  Each  processor  takes  N/k  portions  of  the 
temperature  matrix  T  to  solve  the  four-layer  model  in  the  Z  direction, 
computing  T(m,n,z,t+1). 

6.  While  the  steady-state  is  not  reached,  continue  with  point  1. 

The  shared-memory  paradigm  allows  an  explicit  parallelization  of  the  regions 
computed  with  the  creation  of  dynamic  threads  on  each  processor  of  the  platform, 
using  high-level  directives  via  DOACROSS.  The  following  portion  of  code  is 
representative  to  express  the  X-dimension  partition. 

C*$*  ASSERT  CONCURRENT  CALL 
CSDOACROSS  LOCAL  (I) 

DO  110  1=1, M 

CALL  LINK ( I , T , DDX , DDY , IT , EDDY , EDDYZ ) 

110  CONTINUE 

C$PAR  BARRIER 

The  compiler  interprets  these  directives  format  as  concurrent  calls  to  the  sproc  ( ) 
routines  [14],  spawning  parallel  threads.  The  function,  sprocsO,  is  a  better 
implementation  of  the  fork  ( )  routine,  allowing  the  creation  of  light-weight 
processes.  However,  the  main  drawback  is  the  impossibility  of  porting  this  code  across 
other  shared-memory  platforms.  Future  work  will  go  in  the  direction  of  developing  a 
portable  shared-memory  code  using  POSIX  threads. 


2.3.  The  Message-Passing  Implementation 

With  the  aim  of  assuring  code  portability  and  for  comparing  other  features,  such  as 
programming  efforts,  communications  costs,  and  several  architecture  evaluations,  we 
designed  the  same  application  using  the  distributed-memory  parallel  programming 
paradigm,  using  message  passing. 

We  used  the  MPI  communication  library  [6].  The  programming  effort  is  clearly 
higher  versus  shared-memory.  But  a  wide  spectrum  of  architectures  can  exploit  the 
message-passing  version,  from  workstation  or  PC  networks,  through  real  distributed 
memory  machines  as  well  as  shared-memory  architectures  (either  SMPs  or  CC- 
NUMAs)  [7]. 

The  core  algorithm  is  basically  the  same  as  that  described  in  section  2.2,  except  in 
points  2  and  4.  The  need  of  performing  an  explicit  data  transfer,  via  messages, 
applying  an  all-to-all  communication  pattern,  obliges  stalling  the  computation  phase  to 
initiate  the  collective  data  exchange.  Figure  2,  depicts  one  matrix  distribution  - 
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although  this  is  valid  for  four  matrices  U,  V,  E  and  T.  In  this  figure,  let  us  assume  a 
four- processor  machine  performing  the  X-sweep  (figure  2a).  The  shadow  sub-matrix 
is  assigned  to  processor  2.  When  it  has  finished  its  computation  phase,  before 
performing  the  Y-sweep,  processor  2  must  send  the  blocks  that  the  other  processor 
will  need.  In  the  Y-sweep  phase,  processor  2  is  assigned  with  the  shadow  portion  of 
the  matrix  (in  figure  2b).  Only  one  common  sub-block  is  owned  by  it,  and  the  others 
must  be  received,  as  the  figure  shows.  The  all-to-all  process  is  repeated  when  the  Y- 
phase  has  finished,  to  continue  with  another  time  step. 
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Fig.  2.  Matrix  decomposition.  Computation  domains  between  communication  phases. 

This  inter-processor  exchange  is  made  by  means  of  the  usage  of  the  all-to-all  MPI 
primitive,  and  has  been  the  major  programming  effort,  since  from  the  performance 
point  of  view,  all  this  process  penalises  the  execution  time.  Indeed,  the  basic  portion  of 
information  to  send  is  a  sub-column  (Fortran  matrix  storage)  and  before  the  beginning 
of  a  send  message,  as  well  as  for  the  received  message,  suitable  buffers  must  be 
allocated  in  the  destination  nodes.  Then,  each  processor  calls  M/(k-l)  all-to-all 
primitives,  where  both  receive  and  send  communications  are  controlled  by  the  MPI 
primitive. 


2.4.  Description  of  the  Target  Machines 

A  wide  variety  of  machines  have  been  tested  under  the  parallel  codes,  with  a  double 
function.  First,  the  code  constitutes  a  high-communication  demanding  application, 
therefore,  several  remarks  can  arise  from  the  behaviour  of  such  architectures. 
Additionally,  the  lack  of  portability  shown  by  the  shared-memory  version  is  covered 
by  the  message-passing  code  that  has  been  ported  without  problems.  But  a  poorer 
performance  is  achieved  by  the  latter  version. 
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In  Table  1,  there  is  a  summary  of  the  different  machines,  with  their  main  features. 
The  purpose  of  choosing  these  architectures  is  based  on  their  different  interconnection 
networks,  and  it  permits  the  evaluation  of  the  different  performances  of  an  ADI 
problem.  A  deeper  discussion  of  these  machines  is  presented  in  section  3. 


Processors 

OS 

Compiler 

Power 

Challenge 

16R10000 
(195  MHz) 

POWERpath2  Bus 
(1.2GB/S) 

IRDC  6.2 

MIPS  Pro  7.0 

Origin  2000 

32  R 10000 
(195  MHz) 

Hypercube 

IRK  6.4 

MIPS  Pro  7.2 

Origin  200 

4R10000 
(180  MHz) 

2  linked-crossbar 
(1.2  GB/s) 

IRK  6.4 

MIPS  Pro  7.2 

IBMSP2 

Least  Common 
Ancestor  Network 
(80  MB/s) 

AK  V.4 

3.2.3 

NOW 

Pentium 

4  PentiumPro 
(200  MHz) 

Myrinet, 

(1.28  Gbit/s) 

Linux  2.0.32 

egcs- 

pg77. 1.0.2 

Quad 

Pentium 

4  PentiumPro 
(200  Mhz) 

450GX  Chipset  Bus 
(512  MB/s) 

Eilii 

g77.0.5.22 

Table  1.  Main  features  of  the  Parallel  Machines  under  study. 


3.  Results  and  Discussion 


The  first  results  we  want  to  present  are  those  in  figure  3.  The  reason  is  to  see  how 
the  application  scales  with  the  addition  of  more  processors.  The  shared-memory  code 
does  not  improve  performance  with  more  than  20  processors,  for  a  problem  of 
1000x150.  It  is  important  to  note  that  both  the  Power  Challenge  and  the  Origin  2000 
exhibit  the  same  execution  time  and  the  same  speed-up.  As  they  have  very  different 
networks,  this  behaviour  allows  us  to  conclude  that  the  communication  effects  of  the 
application  still  do  not  stress  the  two  machines,  and  the  loss  in  the  performance  is 
mainly  due  to  a  significant  reduction  in  the  computation-to-communication  ratio  per 
processor.  Therefore,  with  more  realistic  use  of  the  hardware  resources,  for  this 
problem  size,  a  4-processor  Origin  200  and  12-processor  Power  Challenge,  show 
reasonable  performance  results.  Comparing  with  the  optimised  sequential  code,  that 
takes  more  than  20  hours,  we  obtain  3.5  hours  using  8  processors  in  the  Power 
Challenge,  obtaining  a  speed-up  of  6.3. 
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Fig.  3.  Speed-up  is  presented  for  the  shared-memory  version,  for  a  discretized  grid  of 
1000x150.  The  machines  are  a  32-processor  Origin  2000  and  a  Power  Challenge  with  16 
processors. 


Fig.  4.  Elapsed  time  is  presented  for  the  shared-memory  version,  for  a  discretized  grid  of 
1000x150.  A  CC-NUMA  4-processor  Origin  200  and  a  SMP  Power  Challenge  with  12 
processors  are  the  target  architectures. 

The  Power  Challenge  has  12  processors  connected  by  a  bus.  This  architecture 
permits  a  small  degree  of  scalability,  because  the  bus  bandwidth  is  quickly  saturated 
as  the  number  of  processors  is  increased.  The  Origin  200  is  based  on  a  quite  different 
architecture.  The  machine  groups  two  processors  by  nodes.  In  each  node,  there  is  a 
high-bandwidth  crossbar  that  interconnects  two  processors  with  local  memory,  and  the 
other  nodes.  So,  the  shared-memory  application  in  a  bus-based  system  is  mainly 
limited  by  a  problem  of  contention  in  its  memory  accesses.  In  CC-NUMA 
architectures,  the  application  is  limited  by  a  data  distribution  problem.  Therefore,  as 
we  have  mentioned  before,  the  main  reason  why  the  code  shows  the  same 
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performance  tor  these  different  architectures,  is  because  neither  contention  problems 
nor  high-latency  data  demands  appear,  for  the  working  size  of  interest. 

In  a  similar  way,  performance  metrics  for  the  distributed-memory  version  are 
presented  in  figure  5.  This  representation  allows  us  to  evaluate  the  effectiveness  of  the 
communication  library  implementation  and  its  interaction  with  the  architecture  that 
supports  it.  Several  conclusions  arise  from  these  performance  behaviours.  The  results 
can  be  divided  in  three  sub-sets. 

The  performance  achieved  with  these  machines  shows  that  communication 
requirements  are  dominating  the  execution  time.  It  seems  not  to  benefit  from  adding 
more  than  four  processors  in  all  the  cases. 

The  execution  time  for  the  SGI  machines  shows  that  the  Power  Challenge 
architecture  improves  the  execution  time  over  the  Origin  200.  However,  the  former 
quickly  saturates  the  bus  when  more  than  four  processors  are  running  the  application. 


Fig.  5.  Execution  time  achieved  by  the  MPI  version  among  the  different  architectures. 

The  second  set  of  results  are  those  presented  by  the  PCs.  On  the  one  hand,  a 
distributed-memory  architecture  composed  of  four  Pentium  Pro  under  Linux  and  a 
high-bandwidth  low-latency  Myrinet  network  [10][1]].  The  MPI  library  is 
implemented  directly  to  a  message-passing  handler,  (BIP  [12]).  Other 
implementations  of  the  MPI  have  been  tested  previously  over  Myrinet,  but  have 
exhibited  very  poor  results,  because  they  do  not  provide  specific  support  for  the 
Myrinet  hardware.  Only  the  BIP  library  seems  to  be  stable.  Also,  the  compiler  version 
tor  this  platform  generates  code-optimisation  for  Pentium  Pro.  On  the  other  hand,  a 
Quad  Pentium  SMP  architecture  is  the  other  configuration.  The  MPI  implementation 
is  the  MPICH,  and  therefore,  this  socket-based  implementation  shows  poor  results.  In 
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addition,  the  SMP  kernel  for  this  machine  is  still  unstable,  and  does  not  accept  the 
compiler  version  for  code-optimisation,  so  aggressive  features  of  that  processor  are 
turned  off. 

Finally,  the  IBM  SP2  results  show  a  poor  performance,  in  execution  time,  over  all 
the  other  platforms,  although  the  application  scales  well  enough  and  exhibits  better 
communication  costs. 

In  order  to  quantify  the  communication  costs  for  this  benchmark,  in  figure  6  the 
percentage  of  execution  time  spent  in  inter-processor  data  exchange,  can  be  seen.  The 
small  percentage  present  in  the  IBM  SP2  and  in  the  Origin  200  is  quite  significant. 
The  basic  reasons  can  be  found  in  their  real  physically  distributed-memory 
architecture.  Also,  the  effects  of  bus  traffic  saturation  appear.  When  more  than  four 
processors  are  used  in  the  case  of  the  Power  Challenge,  up  to  55%  of  the  total 
execution  time,  for  8  processors,  is  spent  in  the  communication  phases.  So,  only  the 
advanced  features  of  the  R 10000  are  permitting  the  best  execution  among  the  other 
platforms.  Bus  traffic-saturation  effects  also  appear  in  the  Quad  using  four 
processors.  This  58%  of  communication  costs  is  mainly  due  of  the  MPICH 
implementation,  that  relies  on  the  operating  system  to  perform  the  communication 
primitives. 


Fig.  6.  Communication  costs  associated  with  the  different  MPl  communication  library 
implementations. 

Finally,  we  present  the  speed-up  behaviour  as  another  important  metric. 
Considering  both  versions  of  the  application,  a  very  good  benchmark  to  analyse  the 
interconnection  network  performance  is  by  means  of  how  the  application  scales. 
Moreover,  the  aim  of  testing  all  this  wide  spectrum  of  systems  is  to  understand 
correct  architectural  design  trade-offs. 


835 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


In  figure  7,  good  scalability,  up  to  6.3  over  8  processors  is  exhibited  by  the  shared- 
memory  version  on  the  Power  Challenge.  But  for  this  size,  more  than  16  processors  do 
not  improve  the  execution  time  (figure  3).  The  IBM  SP2  presents  the  best  speed-up, 
for  the  distributed  version,  due  to  its  low  communication  costs.  However,  the  poorest 
execution  time  is  achieved  in  this  machine,  because  of  the  utilisation  of  a  previous 
generation  microprocessor. 


Fig.  7.  Speed-up  representation  to  evaluate  application  scalability  bounds. 


4.  Conclusions  and  Future  Work 


To  conclude,  the  industrial  final-user  is  now  exploiting  the  shared-memory  program 
running  on  the  Power  Challenge,  carrying  out  two  simulations  per  day.  New  prototype 
manufacturing  processes  are  achieved  in  such  a  manner  that  they  take  advantage  of  the 
best  criteria  for  the  designs  that  the  simulations  offer  them.  Higher  product  quality  and 
better  market  competitiveness  are  the  expected  results  by  means  this  work. 

In  other  investigative  line,  our  research  group  is  working  in  developing  a  shared- 
memory  portable  code.  The  effort  is  focused  in  getting  multithreading  via  POSIX 
thread  implementation  with  shared-memory  support.  Code  portability,  with  the 
objective  of  wide  spectrum  machine  usage.  Justifies  this  programming  effort  as  we 
have  seen  when  developing  the  distributed-  memory  code. 

The  singularity  of  the  application  means  that  the  possibilities  of  using  parallel 
processing  are  still  unlimited.  Parallel  code  development  nowadays  constitutes  a 
multidisciplinary  area  where  modellers,  computer  architects  and  software  designers 
improve  the  best  trade-off  decision  to  achieve  better  performance,  in  a  field  where  the 
convergence  between  hardware  and  software  is  still  not  sufficiently  enlightened. 
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Abstract.  A  transputer  based  parallel  machine  is  used  as  a  develop¬ 
ment  platform  for  fast  neural  signal  processing  applications  in  physics 
and  electricity.  The  16  node  machine  houses  32-bit  floating  point  digital 
signal  processors  running  as  coprocessor  for  the  transputers,  so  that  sig¬ 
nal  processing  can  be  optimized.  The  application  in  physics  consists  in 
a  prototype  of  an  online  validation  system  for  a  high  event  rate  collider 
experiment,  which  is  implemented  using  neural  networks  for  physics  pro¬ 
cess  identification.  In  electricity,  a  nonintrusive  load  monitoring  system 
for  household  appliances  is  developed  using  a  neural  discriminator  to 
identify  seven  groups  of  appliances. 

keywords;  parallel  processing,  neural  networks,  classifiers,  principal  com¬ 
ponent  einalysis,  real-time  systems. 

1  Introduction 

Neural  networks  find  a  vast  area  of  applications  in  signal  processing  domain  [1]. 
In  particular,  as  classifiers,  neural  networks  have  been  extensively  used  due  to 
their  ability  in  combining  high  classification  efficiency  and  processing  speed  [2]. 
As  inner  products  are  the  main  mathematical  operations  required  by  the  neural 
processing  during  the  production  phase,  neural  classifiers  can  be  efficiently  im¬ 
plemented  in  commercial  progrcimmable  devices,  such  as  digital  signed  processors 
(DSPs) .  Ultimate  limits  in  processing  speed  can  be  reached  if  the  natural  paral¬ 
lelism  of  neural  networks  is  explored  [3].  Therefore,  when  both  performance  and 
speed  are  of  concern  in  a  project,  neural  networks  may  be  considered  an  efficient 
solution. 

In  this  paper  we  describe  two  applications  of  neural  processing  in  a  parallel 
computing  environment.  In  the  first  one,  an  online  validation  system  is  designed 
for  a  high-event  rate  collider  experiment  in  high-energy  physics,  which  is  be¬ 
ing  developed  at  CERN  (Switzerland).  In  this  experiment  (LHC),  bunches  of 
particles  will  collide  in  periods  of  25  nanoseconds,  so  that  a  large  amount  of  ex¬ 
perimental  data  will  be  produced  [4].  However,  events  with  physics  significance 
will  be  extremely  rare.  Thus,  the  incoming  data  flow  needs  to  be  reduced  by  a 
highly  sophisticated  online  validation  system  for  deciding  whether  a  given  event 
should  be  discarded  or  stored  by  the  data  acquisition  system.  The  second  ap¬ 
plication  involves  the  design  of  a  nonintrusive  electrical  load  monitoring  system 
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for  household  appliances.  As  household  appliances  respond  for  a  significant  frac¬ 
tion  (~  25%  in  Brazil)  of  the  total  demand  in  power  consumption,  the  knowledge 
of  the  consumption  profile  of  this  segment  is  valuable  for  energy  conservation 
and  alleviation  of  the  electrical  system  in  peaking  periods. 

Both  applications  described  in  this  paper  are  developed  for  the  TN-310  sys¬ 
tem,  a  multiple  instructions  multiple  data  parallel  computer  with  a  distributed 
memory  architecture  [5].  The  system  (see  Figure  1)  houses  16  nodes  based  on 
T9000  transputers,  which  communicate  with  each  other  by  means  of  a  network  of 
C104  chips  [6].  Each  node  has  access  to  the  communication  network  through  four 
high  speed  (100  Mbits/s)  serial  links  (DS-links).  For  optimizing  signal  processing 
applications,  the  system  includes  fast  32-bit  floating  point  DSPs  (ADSP-21020) 
running  as  coprocessors  for  the  transputers  [7].  For  this  DSP,  every  instruction  is 
executed  in  a  single  cycle  (40  nanoseconds).  In  terms  of  memory,  each  node  com¬ 
prises  256  kbytes  SRAM  used  as  shared  memory,  to  transfer  data  to  and  firom 
the  DSP,  and  8  Mbytes  of  T9000  private  DRAM.  The  DSP  can  be  programmed 
from  the  T9000  through  C  runtime  library  calls. 


Fig.  1.  The  basic  architecture  of  the  TN-310  system. 


The  T9000  is  a  32-bit  microprocessor  that  exhibits  multiprocessing  capabil¬ 
ities.  Communication  between  processes  on  different  processors  takes  place  over 
virtual  channels.  From  the  point  of  view  of  the  applications,  these  communica¬ 
tion  facilities  of  the  T9000  look  attractive.  Moreover,  as  the  TN-310  includes  a 
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fast  DSP  on  each  node,  the  required  speed  for  running  the  neural  processing  can 
be  achieved  in  a  such  transputer  based  machine. 

The  application  development  maJces  use  of  the  C  toolset  environment,  in  order 
to  achieve  ultimate  speed.  This  software  layer  features  hardware  configuration 
and  a  set  of  libraries  and  development  tools  to  support  ANSI  C  programming. 

Developing  an  application  in  this  environment  comprises  two  phases.  In  the 
first,  the  user  configures  the  available  hardware  according  to  the  application. 
This  phase  describes  the  parallel  application  in  terms  of  number  of  processors, 
amount  of  memory  for  each  processor,  interconnection  matrix  of  processors, 
the  use  of  the  cache  memory  and  control  signals.  In  the  second  phase,  the  user 
develops  C  language  codes  according  to  the  resources  available  from  the  hardware 
configuration  of  the  first  phase. 

In  our  case,  the  TN-310  system  is  accessed  through  a  PC  running  MS-DOS 
and  MS-Windows. 


2  The  Applications 

For  both  applications  to  be  described,  fully-connected  feedforward  neural  classi¬ 
fiers  were  designed.  These  classifiers  were  trained  on  preprocessed  data  by  means 
of  backpropagation  method  [2].  The  preprocessing  phase  was  introduced  in  or¬ 
der  to  reduce  the  dimensionability  of  data  input  space,  so  that  more  compact 
classifiers  could  be  developed. 


2.1  Online  Validation  System 

The  LHC  (Large  Hadron  Collider)  will  produce  event  rates  up  to  100  MHz  but 
only  very  rare  events  will  have  physics  significance  to  the  experiment.  In  order 
to  remove  such  deep  background  noise  that  hides  potentially  interesting  events, 
an  online  event  validation  system  is  being  designed  as  a  multi-level  triggering 
system  [8].  In  the  first  level,  a  very  fast  algorithm  will  be  capable  to  reduce  the 
event  rate  to  100  kHz.  The  second-level  triggering  (LVL2)  system  will  only  act 
on  events  that  passed  the  conditions  of  the  previous  level.  Not  all  regions  of  the 
detectors  contain  valuable  information  for  a  given  event,  so  that  only  restricted 
areas  in  the  detector  (known  as  Regions  of  Interest  -  ROI)  will  be  moved  by  the 
first-level  system  to  the  LVL2  system.  This  will  alleviate  bandwidth  requirements 
on  the  LVL2  system,  which  is  expected  to  achieve  a  further  reduction  factor  of 
100  in  the  event  rate.  The  surviving  events  will  be  analyzed  by  a  third  level 
trigger  and  only  10  or  100  events  per  second  will  be  moved  to  permanent  storage. 

The  prototype  being  implemented  in  the  TN-310  system  concerns  the  LVL2 
operation.  The  architecture  used  splits  the  LVL2  operation  into  two  phases  [9]. 
In  the  first,  raw  detector  data  is  translated  into  features  capable  to  efficiently 
identify  the  relevant  physics  processes.  This  feature  extraction  works  on  ROI 
information  provided  by  the  detectors  that  participate  in  the  LVL2  decision: 
calorimeters  (for  energy  measurement),  trackers  (for  image  display  of  inner  in¬ 
teractions)  and  muon  chambers  (for  muon  detection).  Next,  the  global  decision 
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phase  correlates  detector  information  for  analysis  refinment.  Features  are  com¬ 
bined  to  compute  the  probability  of  a  particle  to  be  found  in  a  given  ROI,  so 
that  physics  processes  can  be  identified. 

Both  phases  may  be  performed  by  neural  processing  [10, 11].  For  the  calorime¬ 
ter  subsystem,  feature  extraction  was  performed  by  grouping  cells  of  deposited 
energy  in  a  ROI,  and  feeding  such  grouped  cells  into  the  input  nodes  of  a  neural 
network  that  performs  electron  (signal)/jets  (background  noise)  separation. 

Figure  2  shows  how  cells  are  grouped  together.  Thicker  lines  define  the  bor¬ 
der  of  each  region,  and  cells  belonging  to  a  region  have  their  energies  added  up 
to  form  group  sums.  As  outermost  regions  play  an  important  role  in  the  discrim¬ 
ination  process  but  have  their  energy  values  masked  by  the  substantially  higher 
energy  level  of  the  center  region,  optimum  weighting  factors  for  these  regions 
were  determined  by  integrating  the  search  of  such  optimum  weighting  profile  to 
the  training  phase  of  the  network  [12]. 

This  grouping  scheme  is  capable  to  combine  efficiently  performance  and  com¬ 
pactness,  as  it  achieves  high  discrimination  levels  and  also  reduces  substantially 
the  number  of  input  nodes  of  the  network  (now  ten,  instead  of  the  original  121 
components  of  the  ROI).  For  the  implementation  of  the  neural  feature  extrac¬ 
tor,  the  neural  network  comprised  3  hidden  nodes  and  a  single  output  neuron, 
so  that  electron/jet  (of  particles)  discrimination  could  be  efficiently  performed 
(only  7.3%  of  jets  were  misclassified  as  electrons  for  a  97%  electron  efficiency). 
For  the  other  three  detectors  involved  in  the  LVL2  decision  scheme,  a  simulation 
of  current  classical  algorithms  was  implemented  [8]. 


Fig.  2.  The  grouped  sum  structure. 


For  the  global  decision  phase,  it  was  explored  the  capability  of  neural  net¬ 
works  in  correlating  information  in  a  multidimensional  feature  space.  A  set  of 
twelve  features  from  the  detectors  that  participate  in  the  LVL2  decision  was  fed 
into  the  neural  classifier,  which  comprised  six  neurons  in  the  hidden  layer  and 
four  output  nodes,  so  that  electrons,  pions,  jets  and  muons  could  be  detected 
(see  Figure  3).  The  normalization  of  the  feature  vector  was  performed  by  means 
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of  fixed  factors,  which  were  obtained  by  computing  the  mean  value  of  the  data 
distribution  (restricted  to  the  training  set)  for  each  feature  and  allowing  an  addi¬ 
tional  2a  tail.  Maximum  probability  defined  the  winner  class  for  a  given  pattern 
fed  into  the  input  nodes  of  the  classifier. 

Simulated  data  sets  for  the  second-level  trigger  operation  at  LHC  conditions 
were  used  for  training  the  networks  [13, 14]. 


Normalization  layer 

Fig.  3.  The  global  decision  network. 


2.2  Electrical  Load  Monitoring  System 

Both  transient  and  steady-state  information  were  used  to  characterize  data  ac¬ 
quired  from  seven  groups  of  household  appliances:  refrigerating,  resistive  heat¬ 
ing,  universal  motor,  ventilating,  consume  electronics,  incandescent  lamp  and 
fluorescent  lamp. 

Starting  from  the  time  each  appliance  had  been  switched  on,  current  signals 
were  sampled  (500  samples  per  second)  from  the  AC  line  by  means  of  a  digital 
storage  oscilloscope  for  a  2  second  acquisition  window.  Steady-state  response 
provided  current  and  phase  angle  information  and  measurements  were  made 
after  a  minimum  of  2  seconds  operation  [15]. 

The  envelope  of  the  transient  signal  is  defined  by  the  60  Hz  fundamental 
frequency  of  the  AC  line  and  features  from  its  shape  can  be  extracted  by  retaining 
the  current  peaks.  From  the  original  1024  samples,  200  peaks  were  retained.  A 
principal  discriminating  analysis  was  performed  on  such  samples.  This  analysis 
aimed  at  finding  the  most  discriminating  components  on  data  input  space,  so 
that  classification  can  be  achieved  with  minimal  number  of  hidden  nodes  in  the 
network  [16]. 
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Fig.  4.  Basic  topology  for  the  prindpcil  discriminating  analysis. 


Figure  4  shows  the  network  topology  that  extracts  the  principal  discriminat¬ 
ing  components  and  simultaneously  identifies  the  classes  of  household  appliances. 
Starting  from  a  single  neuron  in  the  hidden  layer  (arrangement  shown  at  the  top 
of  Figure  4) ,  the  network  is  trained  to  obtain  maximum  discrimination  efficiency. 
Next,  a  second  neuron  is  added  to  the  hidden  layer  of  the  network.  The  resulting 
network  is  trained  in  such  a  way  that  the  weighting  vector  that  conects  the  input 
nodes  to  the  first  neuron  in  the  hidden  layer  is  kept  fixed  during  the  training 
procedure,  as  it  represents  the  first  component  already  extracted  in  the  previous 
phase.  All  the  other  synaptic  weights  in  the  network  are  changed  according  to 
the  backpropagation  method,  including  the  weight  vector  that  connects  the  first 
neuron  of  the  hidden  layer  to  the  output  layer.  This  is  to  allow  the  network  to 
combine  in  the  best  way  the  discriminating  components,  as  a  new  component 
becomes  available  in  this  phase.  The  training  procedure  continues  in  this  way, 
by  adding  a  new  neuron  in  the  hidden  layer,  freezing  the  components  previously 
extracted  and  allowing  the  rest  of  the  synaptic  weights  to  be  trained  until  the 
addition  of  new  components  does  not  result  in  an  improvement  on  the  overall 
discrimination  efficiency. 

For  this  application,  only  two  neurons  in  the  hidden-layer  were  needed.  Each 
output  node  was  assigned  to  a  group  of  equipments  and  maximum  probability 
was  used  to  detect  the  winner  class  for  a  given  input. 

3  Implementations 

The  system  implementation  for  both  applications  made  use  of  a  master  node  to 
perform  communication  with  the  outside  world  and  to  supervise  the  continuous 
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distribution  of  data  through  slave  processors.  The  slave  processors  act  as  feature 
extractors  or  global  decision  units  (gdus)  for  the  physics  application  (Figure  5) , 
and  perform  preprocessing  and  neural  classification  for  the  load  monitoring  sys¬ 
tem.  For  the  validation  system,  a  local  network  (split  into  two  processing  nodes) 
is  used  to  label  and  group  data  from  the  feature  extraction  phase  and  to  transmit 
features  to  the  global  decision  layer. 


Fig.  5.  Simplified  scheme  for  the  implementation  of  the  validation  system. 


The  data  parallelism  approach  that  is  being  used  intends  to  minimize  the 
communication  overheads  of  the  TN-310  system.  As  previously  mentioned,  the 
system  uses  a  fast  switching  network  (based  on  C104)  for  fast  packet  distribution 
among  processors.  When  an  information  is  to  be  transmitted,  the  emitter  point 
sends  a  message  to  its  local  virtual  channel  processor  (VCP)  on  the  T9000, 
indicating  the  receiving  address  and  the  data  size  to  be  transmitted.  The  VCP 
splits  the  data  into  a  number  of  packets  of  32  bytes  and  each  packet  carries  a 
header  that  indicates  the  receiving  address  and  the  routing  needed.  Then,  the 
packet  is  sent  to  the  first  switch  in  the  predefined  routing  which  interprets  (and 
then  removes)  the  first  subheader  and  sends  the  packet  to  the  next  node  in 
the  routing.  At  the  final  destination,  each  packet  is  acknowledged  to  allow  the 
transmission  of  a  new  packet.  Figure  6  illustrates  this  scheme. 

The  minimum  time  required  for  packet  (32  bytes)  transmission  was  measured 
to  be  ~  7  ^s.  Therefore,  as  neural  networks  for  both  applications  are  relatively 
compact  (and  fast:  feature  extraction  for  the  calorimeter  is  achieved  in  10  /xs), 
data  communication  time  can  be  considered  quite  significant  to  the  overall  pro¬ 
cessing  speed.  Consequently,  data  parallelism  will  minimize  dependencies  among 
nodes  and  the  speedup  of  the  applications  will  be  maximized. 

As  communication  time  represents  the  bottleneck  of  applications  in  this  en¬ 
vironment,  it  may  be  useless  to  develop  the  application  using  all  resources  of  the 
machine.  For  instance,  when  data  are  distributed  to,  say  slave  #  M  in  the  paral¬ 
lel  processing  chain,  the  first  slave  may  have  finished  its  processing  and  become 
free.  Thus,  further  nodes  added  to  the  chain  will  not  improve  processing  speed. 
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Fig.  6.  Routing  packets  in  the  TN-310  system. 


This  fact  was  explored  to  develop  application  using  minimal  machine  resources. 
On  the  other  hand,  as  slaves  become  free,  a  two  distributor  scheme  may  be 
tried  to  compensate  for  communication  overheads.  Using  twin  data  parallelism 
structures,  an  improvement  in  processing  speed  and  parallel  efficiency  may  be 
achieved. 

The  activation  function  (hyperbolic  tangent)  of  the  neuraJ  networks  was  im¬ 
plemented  by  means  of  a  look  up  table,  in  order  to  achieve  shorter  computation 
times.  The  fixed  step  table  maJces  possible  a  correspondence  between  the  address 
of  a  point  in  the  table  and  its  abscissa.  Thus,  the  look  up  table  is  fast  and  con¬ 
stant  in  execution.  Moreover,  table  size  is  reduced  as  abscissa  values  are  memory 
addressed  and  only  ordinates  are  stored. 


CONTROL 


Fig.  7.  The  way  the  DSP  can  be  accessed  from  the  T9000. 


The  sampling  resolution  for  building  such  table  for  the  applications  and  which 
provided  the  full  reproduction  of  the  simulated  network  operation  with  minimum 
memory  requirements  was  achieved  to  be  0.01.  Saturation  of  the  activation  func- 
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tion  was  considered  to  be  reached  at  7  for  both  positive  and  negative  arguments. 
This  approximation  of  the  behavior  of  the  gain  function  also  did  not  deteriorate 
the  performance  of  the  network  and  reduced  considerably  memory  length  re¬ 
quirements. 

For  a  multiple  processor  application,  a  user  process  placed  on  one  processor 
may  access  the  DSP  on  its  site  and  not  any  other.  The  DSP  library  available  by 
the  TN-310  system  is  written  in  C  language  and  provides  signal  processing  and 
mathematics  functions.  The  way  the  user  process  accesses  its  accompanying  DSP 
is  shown  in  Figure  7.  The  process  has  access  to  the  two  banks  of  shared  memory 
and  controls  the  access  protocol  to  the  DSP.  An  alternate  scheme  is  an  efiicient 
way  to  use  the  buffers:  the  user  process  loads  some  commands,  or  reads  the 
previous  results,  in  one  buffer,  while  the  DSP  computes  de  previous  commands 
of  the  other  buffer.  This  is  supported  by  a  DSPbufSwap  procedure  that  returns 
a  buffer  number.  When  the  procedure  returns  a  buffer  number,  the  user  process 
can  access  this  specific  buffer  to  read  its  contents  or  to  write  commands  into 
it.  Then,  the  process  releases  the  buffer  by  using  the  DSPbufSwap  and  the  DSP 
starts  to  compute  the  given  command.  The  DSPs  were  used  to  compute  inner 
products  required  by  the  neural  processing  and  required  preprocessing. 


[  "•  =■  f  Ci04] 


initialization 

Fig.  8.  The  validation  system  implementation  using  minimal  machine  resources. 


The  prototype  of  the  validation  system  for  physics  was  implemented  using 
13  nodes  (see  Figure  8)  and  was  capable  to  cope  with  a  2.6  kHz  input  frequency 
(speedup  factor  of  4.4).  The  nodes  labeled  trt  and  set  refer  to  subdetectors  of  the 
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tracking  system,  coZo  to  the  calorimetry  system,  and  to  the  muon  chambers. 
For  each  processing  layer  only  two  nodes  were  needed,  as  data  communication 
between  the  supervisor  (master)  node  and  slaves  required  significant  time  when 
neural  processing  speed  is  considered  (as  mentioned  above).  For  instance,  data 
size  for  the  calorimeter  required  more  than  100  /is  in  terms  of  transmission’ time, 
but  the  neural  feature  extraction  for  this  detector  is  performed  in  10  /is. 

The  resulting  system  can  be  considered  to  emulate  a  vertical  slice  of  the 
actual  second-level  triggering  system,  which  will  be  running  in  practice  close  to 
this  speed.  Particle  identification  above  94%  was  achieved  by  the  system. 

For  the  load  monitoring  system,  principal  component  analysis  and  classifica¬ 
tion  was  capable  to  run  in  less  than  100  /is  (speedup  factor  of  8.6).  Here,  as  data 
input  vectors  comprised  200  components,  data  transmission  required  barely  as 
much  time  as  data  processing.  Therefore,  only  6  processing  nodes  were  needed 
to  implement  the  system.  Such  consideration  allowed  the  use  of  a  two  distrib¬ 
utor  configuration  to  this  application,  as  mentioned  above.  Figure  9  shows  this 
configuration,  where  a  master  node  interfaces  with  the  outside  world  and  trans¬ 
mits  data  to  its  slaves  and  the  other  master  node.  Over  100  different  pieces  of 
equipment  studied,  the  system  was  able  to  classify  correctly  more  than  84%  of 
the  sample. 


Fig.  9.  Basic  implementation  scheme  of  the  load  monitoring  system. 


4  Conclusions 

A  transputer  based  parallel  machine  (TN-310  system)  was  used  as  a  develop¬ 
ment  platform  for  neural  processing  applications.  The  applications  explored  the 
main  features  of  this  machine,  which  combines  the  ability  of  the  T9000  trans¬ 
puter  in  managing  communicating  processes  running  in  parallel  and  the  signal 
processing  speed  of  the  ADSP-21020  digital  signal  processor.  Although  signifi¬ 
cant  communication  overheads  were  observed,  the  technology  used  proved  to  be 
fiexible  enough  to  accommodate  different  sorts  of  applications. 
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The  applications  concerned  a  prototype  of  an  online  validation  system  for  a 
high  event  rate  collider  experiment  in  particle  physics  and  a  nonintrusive  elec¬ 
trical  load  monitoring  system  for  household  appliances.  The  design  procedure 
for  both  applications  involved  preprocessing  methods  (a  topological  mapping  or 
principal  component  analysis)  and  the  search  for  the  optimization  of  machine 
resources  according  to  application  requirements.  Neural  processing  was  imple¬ 
mented  by  using  a  look  up  table  for  the  activation  function  and  runtime  library 
calls  to  the  DSPs. 

In  order  to  compare  technologies,  the  applications  are  being  moved  to  a  faster 
DSP  (ADSP-21060,  25  ns  instruction  cycle  time)  that  has  integrated  multipro¬ 
cessing  features.  This  DSP  based  solution  goes  in  the  direction  of  using  more 
cost  effective  commercial  computing  platforms,  as  PC-compatible  commercial 
boards  are  readily  available  in  this  technology. 
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Abstract.  Presently  computer  networks  are  becoming  common  in  every 
place,  connecting  from  only  a  few  to  hundreds  of  personaJ  computers  or 
workstations.  The  idea  of  getting  the  most  out  of  the  computing  power 
installed  is  not  new  and  several  studies  showed  that  for  long  periods  of 
the  day  most  of  the  computers  are  in  the  idle  state.  The  work  herein 
refers  to  a  study  auming  to  find,  for  a  given  algorithm,  the  number  of 
processors  that  should  be  used  in  order  to  get  the  minimum  processing 
time.  To  support  the  parallel  execution  the  WPVM  software  was  used, 
under  a  Windows  NT  network  of  personaJ  computers. 


1  Introduction 

A  parallel  computer  is  composed  by  a  set  of  processors  connected  by  one  net¬ 
work  according  to  one  of  several  possible  topologies  (mesh,  hipercube,  crossbar, 
central  bus,  etc  [8]).  If  the  processors  are  connected  in  order  to  maximize  their 
communication  performance  and  together  operate  exclusively  for  the  solution  of 
one  problem,  then  it  is  called  a  Supercomputer.  If  the  processors  are  of  the  gen¬ 
eral  purpose  type,  each  one  being  a  workstation  connected  by  a  general  purpose 
network  (e.g.  Ethernet),  then  when  they  operate  together  for  the  solution  of  a 
given  problem,  it  is  called  a  Parallel  Virtual  Computer. 

There  are  significant  differences  between  a  Supercomputer  and  a  Parallel 
Virtual  Computer,  such  as  the  interconnection  network.  The  general  purpose 
network  allows  only  the  communication  between  two  processors  simultaneously, 
and  it  could  be  also  shared  by  other  computers  not  belonging  to  the  Parallel 
Virtual  Computer,  resulting  in  low  communication  rates.  The  fine  grain  paral¬ 
lelization,  common  in  Supercomputers,  becomes  impractical  in  Parallel  Virtual 
Computers,  where  medium  or  coarse  grain  parallelizations  are  used,  at  the  pro¬ 
gram  or  procedure  level. 

The  aim  in  the  utilization  of  a  Parallel  Virtual  Computer,  as  for  a  Super¬ 
computer,  is  to  reduce  the  processing  time  of  a  given  program.  This  is  achieved 
by  utilizing  execution  cycles  of  several  computers  that  would  not  be  used  in 
another  way.  The  more  conclusive  measure  of  the  parallelization  performance 
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is  the  reduction  of  the  processing  time  obtained,  or  equivalently  the  Speedup 
obtained. 

This  report  presents  a  method  to  calculate  the  number  of  processors  that 
should  participate  in  the  execution  of  a  given  algorithm  according  to  its  charac¬ 
teristics. 


2  Interconnection  Networks 

A  parallel  computer  is  built  of  processing  elements  and  memories,  called  nodes, 
and  an  interconnection  network,  composed  by  switches,  to  route  messages  be¬ 
tween  those  nodes.  Interconnection  network  topology  is  the  pattern  by  which  the 
switches  are  connected  to  each  other  and  to  nodes.  The  network  topology  can  be 
classified  as  direct  or  indirect.  Direct  topologies  connect  each  switch  directly  to  a 
node,  resulting  a  static  network  that  will  not  change  during  program  execution. 
Examples  of  static  networks  are  the  mesh,  hypercube  and  ring. 

Indirect  topologies  connect  at  least  some  of  the  switches  to  other  switches, 
resulting  dynamic  networks  that  can  be  configured  in  order  to  match  the  com¬ 
munication  demand  in  user  program.  Examples  of  dynamic  networks  are  the 
multistage  interconnection  network,  the  fat-tree,  crossbar  switches  and  buses 

[7]. 

The  interconnection  network  of  a  parallel  virtual  computer  is  similar  to  a  bus 
as  shown  in  Figure  1. 


Fig.  1.  Interconnection  Network  (Ethernet)  of  a  Virtual  Parallel  Computer 


The  logical  topology  of  an  Ethernet  provides  a  single  channel,  or  bus,  that 
carries  Ethernet  signails  to  all  stations  allowing  broadcast  communication.  The 
physical  topology  may  include  bus  cables  or  a  star  cable  layout;  however,  no 
matter  how  computers  are  connected  together,  there  is  only  one  signal  channel 
delivering  packets  over  those  cables  to  all  stations  on  a  given  Ethernet  LAN  [10]. 

Each  message  is  divided  in  packets  of  46  to  1500  bytes  of  data,  to  be  sent 
sequentially  and  individually  onto  the  shared  channel.  For  each  packet  the  com¬ 
puter  has  to  gain  access  to  the  channel.  This  division  of  a  message  into  packets 
leads  to  a  latency  time  for  each  message  that  is  proportional  to  the  number  of 
packets  into  which  it  is  split. 
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=  +  (I) 

w 

Tc  being  the  total  communication  time  for  the  given  message,  k  the  number  of 
packets  into  which  the  message  is  split,  Tl  the  latency  time  for  one  packet,  nbytes 
the  message  length  in  bytes  and  w  the  network  bandwidth.  For  a  particular 
system,  equation  1  is  used  to  estimate  the  value  of  Tl-  For  the  network  used  in 
the  Virtual  Computers,  Ml  and  M2,  referred  to  in  the  results  section,  the  mean 
value  for  Tl  was  estimated  as  being  500  microseconds  and  the  packet  size  to  be 
1Kbyte. 


3  Performance  Measures 


A  Parallel  Computer  can  be  evaluated  according  to  several  characteristics,  such 
as  the  processing  capacity  (Mflop/s),  the  network  bandwidth  (Mbytes/s),  the 
processing  capacity  of  each  processor  individually,  the  memory  access  method 
and  time,  etc.  However,  its  performance  is  always  referred  to  a  given  algorithm. 

The  ratio  between  the  serial  processing  time  and  the  parallel  processing  time 
is  referred  to  as  Speedup  and  reflects  the  gain  obtained  with  the  parallelization; 


Speedup = 


^Serial 


'^Parallel 


(2) 


3.1  Speedup  Limits 

For  a  given  problem,  of  a  given  dimension,  there  is  a  finite  quantity  of  work 
required  to  be  done  in  order  to  obtain  its  solution.  Therefore,  there  will  be  a 
maximum  number  of  processors  to  be  used,  above  which  there  will  not  be  any 
work  to  schedule  for  additional  processors.  Thus,  the  number  of  processors  to  be 
used  and  the  maximum  Speedup  achieved  for  a  given  problem  is  limited  by  the 
quantity  of  work  to  be  done.  As  an  example,  consider  the  addition  of  two  vectors 
of  dimension  n,  which  involves  n  additions.  If  one  uses  more  than  n  processors, 
the  remaining  processors  will  not  have  any  work  to  do,  resulting  a  maximum 
relative  Speedup  of  n. 

Amdahl  [5]  has  defined  a  rule  to  demonstrate  that  the  Speedup  value  is 
limited  by  the  inherently  sequential  part  of  the  program:  let  s  represent  the 
sequential  part  of  the  program,  non  parallelizable,  and  p  the  part  of  the  program 
susceptible  of  being  parallelized,  that  can  execute  with  Speedup  P  in  a  computer 
with  P  processors,  then  the  observed  Speedup  will  be; 

s  +  p 

Speedup  =  —  ^  (3) 

s+p  being  the  processing  time  of  the  sequential  program,  and  P  the  number  of 
processors  used.  As  an  example,  if  the  serial  program  runs  in  93s  in  which  90s 
is  susceptible  of  being  parallelized,  then  p  =  90s/93s  =  0.9677  or  96.77%  and  s. 
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the  inherently  sequential  part,  assumes  the  value  of  0.0323  which  is  3.23%  of  the 
code  [6].  The  sequential  part  is  composed  mainly  by  input/output  operations. 

From  the  speedup  definition  one  can  obtain  its  limit:  as  p  approaches  infinite, 
Speedup  equals  1/s.  For  the  example  presented  above  the  Maximum  Speedup 
is  1/0.032  =  31.25  whatever  the  number  of  processors  used,  being  useless  to  use 
more  than  31  processors. 


Theoretical  Speedup  Observed  Speedup 

Fig.  2.  Comparisiou  between  Theoretical  £ind  Observed  Speedup 


Figure  2  (left)  shows  the  Theoretical  Speedup  for  several  values  of  s,  which  is 
assumed  constant  for  a  given  algorithm,  whichever  the  vsdue  assumed  by  P.  The 
Amdahl  law  introduced  two  important  factors  in  parallelism.  First,  it  allows 
to  have  a  more  realistic  expectation  of  the  results  that  can  be  achieved,  and 
second,  it  shows  that  for  the  Speedup  to  achieve  high  values,  one  must  reduce 
or  eliminate  the  sequential  parts  of  a  given  algorithm  [6]. 

Amdahl  made  s  constant;  however,  in  most  algorithms,  the  increase  in  the 
number  of  processors  leads  to  an  increase  in  the  communication  overheads.  If 
those  are  considered  as  bottlenecks  and  added  to  s,  the  Speedup  behavior  will 
be  like  the  one  shown  in  Figure  2  (right).  The  observed  speedup  presents  a  shape 
that  increases  until  a  given  value  of  P  is  reached,  after  which  it  decreases.  In 
conclusion,  the  ideal  number  of  processors  to  be  used  for  the  solution  of  a  given 
algorithm  will  be  below  of  the  number  obtained  by  the  limit  of  Amdahl’s  law. 


3.2  The  ideal  number  of  processors 

From  the  Speedup  expression  given  above,  its  value  will  increase  as  the  execution 
time  of  the  parallel  program  decreases.  Assuming  that  the  parallel  time  is  given 
by  Tp{n,P)  =  s{n,P)  +  p{n,P),  as  shown  in  Figure  3,  for  a  generic  algorithm, 
the  processing  time  is  composed  by  an  initial  operation  for  data  distribution, 
followed  by  the  time  for  parallel  processing,  including  messages,  and  ending 
with  an  operation  for  collection  of  results  by  the  master  process.  Then  to  get 
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the  highest  speedup,  Tp  has  to  be  as  low  as  possible,  eind  the  optimal  value  for 
P  that  satisfies  this  condition  is  such  that,  for  a  given  algorithm,  the  increase  of 
the  serial  component  due  to  the  addition  of  one  more  processor  will  balance  the 
gain  obtained  in  the  processing  time  of  the  parallel  component. 


Tp 

H — ► 

Time 


Data  distnbution 
and  parallelism 
management 
SI 


Parallel 

Processing 

PI 


Results  received 
by  master  process 
S2 


Fig.  3.  Timing  diagram  of  a  parallel  algorithm 


The  function  representing  the  sequential  component  of  the  algorithm,  s(n,P), 
depends  on  the  problem  dimension,  n,  and  on  the  number  of  processors  used, 
P.  It  represents  the  input/output  operations,  the  sequential  code  required  to 
manage  the  parallelism  and  the  communication  overheads.  Assuming  an  homo¬ 
geneous  computer  network,  where  each  processor  has  an  individual  processing 
capacity  of  SM flop/s,  a  network  bandwidth  of  W Mbytes/ s,  and  n  input  data 
elements  to  be  distributed,  one  gets  the  following  expression: 

^(n,P)  =  ^  ^  +  r^)  P-b  fcyTi  +  (^ 6(P- 1) 

+k2TL{P-l)+^{^^  +  TE^P  +  k,TL  (4) 

In  the  expression  above  it  was  assumed  that  each  process  communicates  only 
with  the  neighbor  processes  which  is  certainly  not  true  for  all  algorithms.  In 
practice  the  communication  components  have  to  be  modeled  for  every  algorithm. 
The  first  factor  represents  the  time  spent  for  the  parallelism  management,  where 
Cl  is  a  constant  dependant  on  the  number  of  instructions  being  used.  The  second 
factor  represents  the  time  required  to  distribute  the  n  elements  of  the  input  data 
among  P  processors,  where  7^  is  the  packing  time  per  byte.  The  third  factor 
represents  the  latency  time  of  P  initial  messages  required  to  distribute  the  input 
data,  where  ki  is  the  number  of  packets  required  to  send  the  data.  The  fourth 
factor  represents  the  time  spent  by  each  processor  in  communications  with  the 
next  processor  for  the  parallel  algorithm,  transmission  and  packing  for  sending 
and  receiving.  The  fifth  factor  represents  the  latency  time  of  those  messages  per 
processor,  Aig  being  the  number  of  packets  required.  The  sixth  factor  represents 
the  time  required  by  the  master  process  to  receive  the  results  from  all  processes. 
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h  being  the  number  of  bytes  to  receive.  The  last  factor  is  the  latency  time  for 
the  results  messages,  where  is  the  number  of  packets  required. 

The  function  representing  the  algorithm’s  paxaJlel  component,  p{n,P),  de¬ 
pends  also  on  the  problem  dimension  and  on  the  number  of  processors  used.  It 
can  be  calculated  as: 


=  (5) 

^  being  the  number  of  instructions  computed  a  times  for  each  element  of  n. 
From  the  addition  of  both  expressions,  s  -h  p,  one  can  obtain  the  total  parallel 
processing  time: 

T,(n.P)  =  ^  + 1  p  +  T,)  J (P  -  1) 

+*==J’‘(P-i)  +  |(^+r.)p  +  i,n  +  ^  (6) 

For  a  given  problem  of  dimension  n  and  assuming  the  constants  a  and  /?  are 
known,  the  minimum  value  for  Tp{n,P)  is  given  by  ^  =  0: 

§■  +  +  ^  =  0  (7) 

resulting  for  P  the  value 


P  = 


nP^/S 


CilS  +  k2TL  +  {llW  +  TB)h 


(8) 


This  expression  shows  that  P  is  obtained  by  the  square  root  of  the  ratio 
between  the  useful  processing  time  by  the  time  spent  in  communications  and 
parallelism  management.  The  value  of  P  is  then  used  to  compute  the  expected 
processing  time  for  the  parallel  program,  Tp,  which  is  then  compared  to  the 
estimated  serial  processing  time,  Ts.  If  it  happens  that  Tp  >  Ts  then  the  serial 
version  of  the  algorithm  is  used. 


3.3  Application  to  the  Parallel  Virtual  Computer 

The  nodes  of  the  Parallel  Virtual  Computer  are  composed  by  processors  of  differ¬ 
ent  characteristics,  mainly  with  respect  to  the  processing  capacity  and  memory 
available,  forming  an  heterogeneous  system.  Therefore,  the  value  of  5,  made 
constant  in  the  computation  of  P,  needs  to  be  replaced  by  a  value  that' repre¬ 
sents  the  Heterogeneous  Parallel  Virtual  Computer.  Thus,  S  can  be  replaced  by 
a  weighted  mean  such  as: 

_  M  M 

5  =  SiWi/  Wi  (9) 

i=l  i=:l 
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Wi  being  the  weight  given  to  processor  i,  defined  as  the  ratio  between  the  pro¬ 
cessing  capacity  of  processor  i  and  the  capacity  of  the  fastest  processor  in  the 
Virtual  Computer:  Wi  =  Si/ S max-  With  this  weight  the  fastest  processors  have 
more  influence,  leading  to  lower  values  of  P,  and  probably  to  the  utilization  of 
only  the  fastest  processors.  The  distribution  of  computational  load  {k)  is  made 
proportional  to  the  relative  processing  capacity  of  the  i  processor: 

p 

li  =  Si/'£Sg  (10) 

fc=i 


4  Implementation 

The  parallel  implementation  of  the  image  processing  algorithms  referred  to  in  the 
results  section  is  done  under  the  WPVM  software,  which  is  an  implementation  of 
PVM  for  the  MS  Windows  operating  system,  developed  at  University  of  Coim¬ 
bra,  Portugal  [1,2].  The  software  can  be  downloaded  from  http://dsg.dei.uc.pt 
/wpvm.  WPVM  offers  the  same  set  of  functions  as  standard  PVM  [4]  and  allows 
the  interaction  between  WPVM  and  PVM  hosts. 

5  Results 

To  validate  the  presented  methodology,  two  image  processing  algorithms  were 
implemented:  a  step  edge  detection  algorithm  [9]  and  an  algorithm  for  histogram 
computation  [3].  Also,  two  parallel  virtual  computers  were  used.  Ml  and  M2, 
composed  by  the  following  processor  capacities,  in  Mflops:  Ml={80,  80,  80,  80, 
45,  45,  40,  40,  35,  35,  35}  and  M2={161,  161,  105,  91,  80),  with  equivalent 
processing  capacities,  SI  and  S2,  of  68.8  and  129.7  Mflops,  respectively. 

For  both  algorithms  the  computational  load  is  evenly  distributed  on  the  do¬ 
main  space,  since  the  same  operation  is  carried  out  for  every  input  data  element 
(an  image  pixel) . 


5.1  Step  edge  detection  algorithm 

Edge  detection  is  an  important  subject  in  image  processing  because  the  edges 
correspond  in  general  to  objects  that  one  wants  to  segment.  The  edge  detector 
operator  descrived  in  [9]  is  an  optimal  linear  operator  of  a  infinite  window  size. 
The  operator  is  an  infinite  impulse  response  filter  realized  by  a  recursive  algo¬ 
rithm.  To  find  the  filter  response  for  each  pixel,  the  components  along  x  and  y 
are  first  computed.  Each  of  them  requires  four  basic  operations  to  be  executed, 
as  shown  in  Figure  4. 

The  operations  are  independent  from  each  other;  however,  in  each  basic  op¬ 
eration,  the  result  for  the  previous  pixel  has  to  be  known.  This  dependency  is 
important  in  the  context  of  data  distribution,  which  should  minimize  the  number 
of  accesses  to  non  local  data.  For  this  algorithm,  a  square  blocked  distribution 
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Fig.  4.  Edge  detection  aJgorithm 


(see  Figure  5),  requires  more  messages  than  the  column  or  row  blocked  strate¬ 
gies.  The  data  distribution  selected  was  row  blocked,  because  the  image  is  stored 
row  by  row  in  the  computer  memory,  requiring  only  one  packing  instruction  to 
pack  several  contiguous  rows.  Row  and  column  blocked  distributions  require  the 
same  amount  of  data  to  be  transferred  among  processors. 


Row  blocked  Column  blocked 


□ 

I  ' 

Square  blocked 


Fig.  5.  Data  distribution  strategies 


Due  to  the  row  blocked  distribution,  the  result  along  columns  has  to  be  sent 
to  the  neighboring  processors.  The  parallel  implementation  of  the  algorithm 
should  allow  each  processor  to  start  any  of  the  four  basic  operations  as  soon  as 
they  have  the  data  to  start,  avoiding  the  idle  state.  Figure  6  shows  an  optimized 
timing  diagram  for  3  processors,  where  processing  starts  as  soon  as  the  data  is 
available  and  processors  give  priority  to  the  operation  results  that  others  may 
be  waiting  for;  in  this  case  the  priority  is  given  to  operations  along  columns. 


pi 

P2 

P3 


IJ— ^  \\^  I  It  I 

— ^  I  ii  I  [tziPezi 

Processing  Communicating 


Fig.  6.  Timing  diagram  for  3  processors 
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By  measures  made  in  tests  of  the  parallel  algorithms,  the  parameters  that 
characterize  them  were  obtained  as  shown  in  Table  1.  The  interconnection  net¬ 
work,  a  Fast  Ethernet  at  lOOMbit/s,  has  a  transmission  time  of  O-OS^s/byte. 
The  packing  time  Te,  of  value  0.07/is,  was  measured  indirectly. 


Algorithm 

/3“(Mflop/kb) 

ks 

b  (bytes) 

Cl  (Mflops) 

Edge  Det. 

1.333 

0.8 

Histogram 

0.150 

BwB 

■iSBn 

0.8 

Table  1.  Algorithm  parameters 


The  edge  detection  algorithm  was  run  in  machine  Ml  with  68.8  Mflops,  the 
master  process  being  in  a  80  Mflops  computer.  By  replacing  all  the  parameters 
in  the  expression  of  P,  for  an  image  of  64  kb  (256  x  256),  one  gets: 


P  = 


64  X  1.333/68.8 


0.8/80+0.5  X  10-3  [145x^256]  +  145  X  256(0.15  x  lO'^) 


6.03  (11) 


The  serial  component  due  to  the  parallelization  management  is  run  in  the 
master  process  and  therefore  it  is  the  master  speed  that  divides  the  constant  Ci . 
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Proc.  time  for  an  image  of  64kb  Proc.  time  for  an  image  of  20.25  kb 


Fig.  7 .  Processing  time  of  the  parallel  algorithm  in  a  Parallel  Virtual  Computer 


Figure  7  represents  the  processing  time  of  the  parallel  algorithm  when  ex¬ 
ecuted  in  the  virtual  computer  Ml  and  for  P  varying  from  1  to  10  processors, 
the  fastest  ones  being  chosen  in  each  case.  There  is  a  decrease  in  the  processing 
time  until  P  reaches  6;  above  that  number  it  increases.  From  these  results  one 
concludes  that  the  ideal  number  of  processors  to  be  used  for  the  64kb  image  is 
6  processors,  as  obtained  theoretically. 

The  second  example  is  for  an  image  of  144  x  144  pixels,  20.25  kb,  where 
the  value  obtained  for  P,  for  the  Ml  virtual  computer,  was  4.07  processors. 
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Experimentally,  as  shown  in  Figure  7,  the  best  performance  is  obtained  for  4 
processors.  This  result  confirms  again  the  validity  of  the  theoretical  model  used. 


5.2  Histogram  algorithm 

Histogram  computation  is  also  an  extensively  used  algorithm  in  image  process¬ 
ing,  often  as  a  preprocessing  stage  of  more  elaborated  algorithms.  Basically,  the 
algorithm  consists  in  counting  the  occurrences  of  each  pixel  level.  The  input  is 
an  image  and  the  output  is  a  vector,  of  integer  values,  with  length  equal  to  the 
number  of  values  that  a  pixel  can  assume.  For  an  8  bit  representation,  the  length 
is  2®  =  256. 

Each  processor  computes  a  segment  of  the  histogram  requiring  for  that  to 
collect  that  segment  from  the  other  processors.  Therefore,  the  amount  of  data 
that  is  required  to  exchange  is  independant  of  the  distribution  used,  although 
dependant  on  the  number  of  processors  used,  since  each  processor  has  to  receive 
from  the  other  P  —  1  processors  the  histogram  segment  that  it  is  assigned  to 
compute.  As  shown  in  Table  1,  the  amount  of  data  to  exchange,  b,  is  given  as  a 
function  of  P.  This  changes  the  expression  to  compute  P,  the  ideal  number  of 
processors,  which  now  becomes: 


1)P 

(12) 

(13) 

:0 

The  equation  is  a  polynomial  in  P,  of  degree  3.  Assuming  an  8  bit  represen¬ 
tation  for  image  pixels,  N  assumes  the  value  of  256.  The  value  of  n  is  the  image 
size  in  kb.  The  network  parameters  assume  the  same  values  as  before.  Since  the 
histogram  vector  can  be  sent  in  a  single  1024  packet,  k2  equals  1.  Replacing 
these  values  in  the  above  expression,  one  gets: 

10xl0-3p3  -f-  -h  256  X  0.15  X  10“®  -  0.5  x  10"®)  P^  -  =  0 

Figure  8  shows  the  theoretical  number  of  processors  that  will  give  the  best 
speedup,  for  the  cases  when  the  master  process  runs  in  a  80  Mflops  computer 
or  an  161  Mflops  one.  The  machine  response  time  depends  on  the  speed  of  the 
computer  that  runs  the  master  process,  since  the  constant  Ci  is  divided  by  its 
speed.  For  images  of  64  kb  and  256  kb  the  optimum  value  of  P  is,  respectively, 
2.4  and  4.5,  for  a  80  Mflops  master  computer.  For  a  161  Mflops  master  computer, 
the  values  are  3.1  and  5.4  processors. 

Figure  9  shows  the  measures  of  the  processing  time  made  with  the  virtual 
machine  M2  for  images  of  64  kb  and  256  kb.  As  obtained  theoretically  the 


r(n,P)  =  %^  + 


p^k,n  +  \E  +  TE)j{p- 


+k2TL{P  -  l)P  +  ^ 


w 


-I-  Te  )  P  +  kiTi  + 


P5 


dT  Cl 


dP 


1 


W 


^  =  ^+{^  +  TE]N  +  2k2nP-k2TL- 


Snl5^ 

P252 
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=  2TLk2P^  +  +  AT  P''-^  = 
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Solution  for  an  image  of  64kb  Solution  for  an  image  of  256kb 
Fig.  8.  Solution  of  the  polynomial  equation 


optimum  value  practically  found  for  P  was,  respectively,  2  and  4  when  the  master 
process  runs  in  the  80  Mflops  computer.  When  the  master  process  is  executed 
on  the  161  Mflops  computer,  the  optimum  measured  value  is  2  and  5  processors. 
The  value  of  P  =  5  corresponds  to  the  theoretical  value  for  the  256  kb  image; 
however,  for  the  64kb  image,  the  theoretical  value  is  3  as  opposed  to  the  value  2 
found  practically.  This  discrepancy  is  in  fact  not  very  significant  as  the  practical 
processing  time  found  for  2  and  3  processors  differs  only  slightly,  as  shown  in 
Figure  9. 

Additionally,  the  discrepancy  can  be  justified  by  the  fact  that  the  proces¬ 
sor  running  the  master  process  also  runs  an  instance  of  the  slave  process;  as  a 
consequence  one  can  consider  that  the  available  capacity  of  this  processor,  as 
used  by  the  slave  process,  is  decreased;  in  fact  for  a  lower  capacity,  the  optimum 
theoretical  number  of  processors  would  decrease. 

Figure  9  also  shows  an  important  feature  of  the  parallel  virtual  computer, 
namely  the  advantage  of  using  the  parallel  version  of  the  algorithm  even  when 
the  user  opts  to  launch  a  single  slave  process;  in  fact,  when  the  user  is  logged 
on  a  slow  machine,  the  serial  version  of  the  algorithm  uses  the  same  machine  for 
the  whole  workload;  if  the  parallel  version  is  selected,  then  the  master  process 
is  run  on  the  slow  machine,  but  the  slave  process  is  assigned  to  the  fastest 
available  computer,  thus  reducing  the  global  processing  time.  This  is  confirmed 
by  inspecting  the  measurements  displayed  in  Figure  9  for  the  80  Mflops  curves: 
P  =  0  corresponds  to  the  serial  algorithm  and  P  =  1  corresponds  to  the  parallel 
one.  Notice  also  that  when  the  user  is  logged  on  the  fastest  machine  available 
(161  Mflops  curves),  the  serial  version  is  naturally  faster,  as  the  parallel  one  has 
communication  overheads  that  are  unnecessary  in  this  situation  of  a  single  slave 
process. 


5.3  Load  Balancing 

In  a  virtual  parallel  computer  there  are  frequently  machines  of  many  different 
processing  capacities,  therefore  the  load  distribution  should  be  managed  in  order 
to  assign  more  work  to  faster  processors,  so  that  all  processors  finish  at  about 
the  same  time.  A  test  with  machine  M2  and  the  edge  detection  algorithm  was 
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Proc.  time  for  an  image  of  64kb 


Proc.  time  for  an  image  of  256kb 


Fig.  9.  Processing  time  of  the  histogram  algorithm 


carried  out  to  verify  the  validity  of  the  strategy  suggested  above.  Since  the  com¬ 
putational  load  is  evenly  distributed  over  the  domain  it  is  straightforward  to 
make  the  size  of  each  row  block  proportional  to  the  processor  capacity.  Figure 
10  shows  the  response  time  and  the  time  to  process  the  edge  detection  algo¬ 
rithm  kernel  for  each  processor,  for  two  situations:  3  and  4  processors  working. 
Processors  PI,  P2,  P3  and  P4  have  processing  capacities  of  161,  161,  105  and 
91  Mflops  respectively.  The  master  process  runs  on  processor  PI. 


Tlmbinu) 


Time  <msi 


Fig.  10.  Response  time  eind  work  done  by  each  processor 
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FVom  Figure  10  it  turns  out  that  the  load  is  well  distributed,  that  is,  a 
balanced  workload  was  obtained  even  for  largely  different  processing  speeds  of 
the  Virtual  Computer  nodes.  For  algorithms  where  the  load  is  not  so  evenly 
distributed  over  the  domain,  one  has  to  have  the  additional  work  of  blocking 
the  domain  in  blocks  of  similar  workload,  for  a  static  implementation  of  load 
balancing. 

6  Conclusions 

It  was  proved  that  for  obtaining  a  good  performance  of  the  Parallel  Virtual 
Computer,  it  is  required  to  know  the  algorithm  parameters,  in  order  to  compute 
the  correct  number  of  processors  to  use  for  its  execution  in  the  virtual  computer. 
A  methodology  to  obtain  this  number  was  presented,  as  well  as  some  test  results 
that  proved  its  satisfactory  accuracy. 

The  results  suggest  that  a  slow  processor  can  be  used  to  log  on  the  user 
(and  to  run  the  master  process),  providing  also  some  computation  according 
to  its  capacity.  This  type  of  network  allows  a  staged  upgrade,  since  adding  a 
fast  computer  to  the  network  has  a  direct  and  positive  impact  on  the  global 
performance,  no  matter  which  processor  is  used  to  launch  the  algorithm. 
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Abstract.  This  paper  '  describes  a  methodology  suitable  for  behavioural  anal¬ 
ysis  of  parallel  real-time  and  embedded  systems.  The  main  goal  of  the  method¬ 
ology  is  to  achieve  a  proper  conhguration  of  the  system  in  order  to  fulfill  the 
real-time  constraints  specified  for  it.  The  analysis  is  based  on  the  measurement 
of  a  prototype  of  the  system  and  is  supported  by  a  behavioural  model.  The  main 
components  of  this  model  are  known  as  “macro-activities”,  that  is,  the  sequences 
of  activities  which  are  carried  out  in  response  to  input  events,  causing  the  cor¬ 
responding  output  events.  This  supposes  a  behavioural  view  in  the  analysis  that 
complements  the  more  usual  structural  and  resource  views.  The  methodology  in¬ 
corporates  steps  of  diagnosis  (evaluation  of  the  causes  of  system  behaviour)  and 
configuration  (planning  of  alternatives  for  design  improvement  after  diagnosis). 
The  experimental  results  of  applying  the  methodology  to  the  analysis  of  a  well- 
known  case  study  are  also  an  important  part  of  this  paper. 


1  Introduction 

The  motivation  of  this  work  comes  from  the  lack  of  research  works  addressing  jointly 
the  three  following  aspects  related  to  the  analysis  of  systems:  1)  Development  of  a 
methodology  of  system  behavioural  analysis;  2)  Addressing  the  particular  problems  of 
real-time  and  embedded  systems;  and  3)  Use  of  analysis  metrics  obtained  from  a  event 
trace  after  system  execution.  Effectively,  none  of  the  research  works  among  the  more 
relevant  ones  in  the  area  addresses  together  the  three  aspects.  In  [5]  and  [3]  only  the 
metric  and  real-time  aspects  are  considered  respectively.  In  [10],  [2]  and  [9]  method¬ 
ological  aspects  in  metric  based  analysis  are  shown.  In  1 1]  a  set  of  metrics  tor  real-time 
systems  are  used.  Besides,  regarding  to  the  analysis  metrics,  this  work  defines  metrics 
corresponding  to  the  three  possible  system  views  [8]. that  is,  behavioural,  structural  and 
resource  views.  Structural  view  is  an  static  view  of  the  system  and  provides  information 
about  its  design.  Resource  view  provides  dynamic  infomiation  about  resource  use  an 
is,  together  with  the  structural  view,  the  more  usual  view  in  system  behavioural  analy¬ 
sis.  The  last  view,  behavioural  view,  provides  dynamic  information  about  the  temporal 
behaviour  of  the  system  in  terms  of  sequences  of  activities  along  system  execution, 

'  This  research  work  has  been  supported  by  the  ESPRIT  HPC  8169  project  ESCORT. 


865 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


The  organization  of  the  rest  of  the  paper  is  what  follows:  in  point  2  the  model 
used  in  the  behavioural  view  is  presented;  in  point  3  the  main  steps  and  aspects  of  the 
analysis  methodology  are  shown;  in  point  4  the  metrics  that  support  the  methodology 
and  their  utility  are  commented;  in  point  5  a  well  known  case  study  is  analyzed  using  the 
methodology;  finally  in  point  6  the  conclusions  and  future  work  are  presented.  Although 
the  methodology  can  be  applied  to  real-time  systems  in  general,  this  work  deals  with 
parallel  real-time  embedded  systems,  hereafter  refered  to  simply  as  real-time  systems 


2  Behavioural  model 

To  describe  the  temporal  behaviour  ot  RTS  in  terms  of  events,  delays  and  actions,  sev¬ 
eral  approaches  or  behavioural  models  can  be  employed.  A  model  based  on  an  event- 
ordering  graph  has  been  selected  as  the  behavioural  model  for  this  research  work.  The 
selection  was  made  as  a  result  ot  its  simplicity,  wide  applicability  and  suitability  for 
the  proposed  methodology.  The  model  is  composed  of  two  main  elements  (see  [8]):  I ) 
Activities,  which  are  represented  by  a  .sequence  of  three  events;  Ready,  when  the  ac¬ 
tivity  is  ready  to  start;  Begin,  when  it  starts;  and  End,  when  it  finishes;  and  2)  A  set 
of  precedence  and  synchronization  relationships  defined  on  the  ready  and  end  events, 
which  establish  partial  ordering  fora  group  of  activities.  These  relationships  give  rise  to 
the  following  kinds  of  activities:  sequential  activity  (SEQ);  activity  of  synchronization 
with  other  macro-activities  (SYN);  alternative  or  conditional  execution  activity  (ALT); 
replicated  activity  (REP);  and  activity  executed  in  parallel  with  others  (PAR).  These 
kinds  of  activities  are  repre.sented  in  Figure  1 . 


SEQ /SYN  ALT /REP 


PAR 


Fig-1  .  Kinds  of  activities 


Fig.  2.  RTS  development  cycle 


In  the  model,  three  principal  times  are  associated  to  each  one  of  these  activities  to 
explain  its  temporal  behaviour:  Waiting  Time  (the  time  between  the  time-stamps  of  its 
ready  and  begin  events).  Service  Time  (the  time  between  the  time-stamps  of  its  begin 
and  e/i/f  events)  and  Response  Time  (the  sum  of  Waiting  and  Service  Times).  The  model 
represents  sequences  of  activities  executed  in  response  to  the  main  input  events  which 
the  RTS  must  deal  with.  These  sequences,  called  macro-activities  in  the  methodology. 


866 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


are  equivalent  to  the  conventional  end-to-end  tasks  found  in  related  literature.  Con¬ 
struction  of  the  model  implies  an  understanding  of  the  functional  and  structural  models 
provided  by  the  RTS  design  methodology. 

3  Analysis  methodology 

The  goal  of  the  methodology  is  the  analysis  of  RTS  behaviour  from  the  early  design 
phase  to  its  final  implementation.  To  achieve  this,  the  methodology  involves  the  con¬ 
struction  of  a  synthetic  prototype  [4]  of  the  RTS  design,  which  is  analyzed  and  refined 
until  the  validation  of  the  timing  requirements  is  achieved.  It  also  serves  as  a  skeleton 
for  the  implementation  phase.  In  Figure  2  the  integration  of  the  methodology  in  the 
RTS  development  cycle  is  shown.  Two  refining  cycles  can  be  derived  from  the  method¬ 
ology;  one  working  with  early  designs  through  prototypes  of  the  RTS,  and  the  other 
with  the  implementation  of  the  RTS.  The  methodology  approach  can  be  resumed  in  the 
following  10  steps: 

1 .  Prototyping  of  the  initial  RTS  design  under  analysis. 

2.  Understanding  of  the  sequences  of  activities  to  be  executed  as  RTS  responses  to 
events  (macro-activities),  selecting  the  ones  to  be  considered  in  the  analysis,  and 
establishing  a  specification  of  behaviour  for  them. 

3.  Instrumentation  of  the  RTS  software  prototype  to  enable  the  monitoring  system  to 
obtain  information  about  the  behavior  of  macro-activities  during  the  RTS  execution. 

4.  Execution  of  the  instrumented  RTS  under  specific  operational  conditions  (scenario) 
over  a  period  of  time  long  enough  to  obtain  a  representative  event  trace. 

5.  Checking  of  the  fulfillment  of  the  real-time  constraints  defined  in  the  specification 
of  behaviour  for  the  RTS. 

6.  Development  of  a  multi-level  analysis  in  specific  temporal  analysis  windows  based 
on  a  set  of  parameters  and  metrics  derived  from  the  trace,  and  covering  structural, 
behavioural  and  resource  views. 

7.  Identification  of  critical  macro-activities  which  do  not  fulfill  their  specifications 
of  behaviour,  and  evaluation  of  the  incidence  of  a  set  of  po.ssible  causes  of  the 
behaviour  observed  in  the  RTS  as  a  whole,  the  critical  macro-activities  and  the 
critical  activities  within  them  (diagnosis  of  temporal  behaviour). 

8.  Tuning  of  the  system  design  according  to  the  incidence  of  each  cause  of  behaviour, 
establishing  a  suitable  configuration  of  the  RTS. 

9.  Repetition  of  the  analysis  cycle  until  a  final  prototype  which  permits  timing  valida¬ 
tion  is  obtained. 

10.  Implementation  of  RTS  design  and  repetition  of  the  analysis  cycle  until  its  final 
implementation. 

The  main  aspects  of  the  methodology,  highlighted  above  in  bold  face,  are  briefly 
described  in  the  following  points. 

3. 1  Specification  of  behaviou  r 

The  specification  of  behaviour  of  the  RTS  consists  of  specifications  of  the  behaviour 
of  each  macro-activity.  These  specifications  consider  load  characteristics  and  real-time 
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constraints  in  macro-activities.  The  most  typical  niacro-acliviiies  in  a  RTS  respond  to 
periodic  events  and,  consequently,  are  characterized  by  a  periodic  execution.  These 
are  called  periodic  macro-activities.  Macro-activities  responding  to  aperiodic  events 
are  called  aperiodic  macro-activities.  The  period  ot  activation  is  the  main  load  charac¬ 
teristic  for  periodic  macro-activities,  while  the  mean  activation  period  and  the  typical 
deviation  of  it  characterize  aperiodic  macro-activities.  The  real-time  constraints  consid¬ 
ered  for  both  periodic  and  aperiodic  macro-activities  are:  the  absorption  of  productivity' 
and  the  deadline  (end-to-end  deadline).  The  fulfillment  of  the  first  constraint  implies 
capacity  of  the  RTS  to  respond  to  all  the  input  events  produced  during  execution. 

3.2  Monitoring  system 

The  monitoring  system,  or  simply  the  monitor,  is  highly  dependent  on  the  target  system 
for  which  it  is  developed.  In  the  context  of  this  research  work,  a  full  software  monitor 
for  a  multiprocessor  based  on  T9000  transputers  was  developed  [7].  The  function  of 
the  monitor  is  to  trace  the  occurrence  of  the  most  relevant  software  events  during  an 
application  execution,  and  to  store  information  related  to  them  in  a  set  of  trace  files. 
So,  the  functionality  of  the  monitoring  system  consists  of  run-time  events  (communica¬ 
tions,  synchronization  operations,  I/O  operations,  etc. ),  macro-activity  events  (start)  and 
activity  events  (ready,  begin  and  end).  The  monitor  is  structured  in  three  main  compo¬ 
nents:  a  set  of  distributed  monitoring  processes,  a  collection  of  instrumentation  probes 
spread  over  the  application  processes,  and  one  instrumentation  data  structure  per  ap¬ 
plication  process.  In  Figure  3  the  instrumentation  of  one  activity  is  shown.  Finally,  two 
steps  are  taken  in  order  to  improve  the  quality  of  measurement:  precise  synchronization 
of  the  system  clocks  (error  <  10  microsec.  with  clock  resolution  =  I  microsec.)  and 
reduction  of  the  monitor  intrusiveness  (26-37  microsec./probe)  to  a  minimum. 


Fig.  3.  Software  instrumentation 


Fig.  4.  Ancdy.sis  approach 


3.3  Multi-level  analysis 

This  methodology  allows  the  analysis  of  the  system  at  three  possible  levels  of  abstrac¬ 
tion.  The  first  level  considers  the  analysis  of  the  RTS  as  a  whole.  The  second  level 
considers  the  analysis  of  each  macro-activity  of  the  RTS.  Finally,  the  third  level  consid¬ 
ers  the  analysis  of  each  of  the  activities  composing  the  macro-activities.  This  multi-level 
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character  of  the  analysis  permits  a  top-down  approach,  which  is  very  useful  in  explain¬ 
ing  the  behaviour  observed  in  the  RTS.  Figure  4  resumes  the  steps  carried  out  during  the 
analysis  process.  Starting  from  the  event  trace  obtained  by  the  monitor  after  RTS  execu¬ 
tion,  an  X-window  tool  permits  the  validation  of  the  trace  according  to  the  behavioural 
model  and  generates  the  parameter  and  metric  values.  From  these  values,  the  diagnosis 
process  is  carried  out,  checking  the  tulfillment  ot  the  real-time  constraints  and  obtaining 
the  causes  of  behaviour  of  the  RTS.  Finally,  with  the  cau.ses  of  behaviour,  parameters 
and  metrics,  the  configuration  process  suggests  alternatives  for  design  improvement. 
According  to  the  multi-level  analysis  character  described  above,  the  methodology 
considers  three  possible  analysis  windows:  RTS  Window  (a  temporal  window  long 
enough  to  represent  all  the  system  behaviour  characteristics  for  the  scenario  under  anal¬ 
ysis),  Macro-activity  Window  (which  corresponds  to  the  longest  response  interval  of  a 
macro-activity  within  the  RTS  window)  and  Activity  Window  (which  corresponds  to  the 
response  interval  of  an  activity  within  the  macro-activity  window).  The  RTS  window  is 
obtained  from  the  basic  RTS  period,  which  corresponds  to  the  Least  Common  Multiple 
(LCM)  of  all  the  periods  specified  for  the  periodic  macro-activities,  as  seen  in  figure  5. 
The  number  of  basic  RTS  periods  in  the  RTS  window  is  fixed  considering  factors  such 
as  the  transient  effect  caused  by  pipelining,  the  statistic  characteristics  of  the  aperiodic 
macro-activities,  and  the  variability  of  the  response  times  in  the  macro-activities. 


Mac.  0 

Mac.  1 

Mac.  2 


H’ntiK'riml  sU'MfM.n. 
M.p.O  I 


M.p.n  B  ft  ms. 

M.p.I  B  4  mv  *->  RT.*;  iKTiixl  = 
M.p.?  *  2  ms. 


Fig.  5.  Basic  RTS  period 


Fig.  6.  Parumeters  in  methodology 


3.4  Parameters  and  metrics 

Parameters  resume  all  the  known  information  about  the  RTS  before  execution.  So,  while 
they  give  only  a  static  view  of  the  system,  they  are  necessary  to  determine  the  influence 
of  the  design  components  on  the  behaviour  of  the  system.  Four  kinds  ot  parameters  are 
considered  in  the  methodology:  load  parameters,  parameters  ol  the  structural  model, 
parameters  of  the  behavioural  model,  and  parameters  ot  connection  between  models. 
Load  parameters  define  the  demands  of  service  on  the  system  from  the  environment, 
and  reflect  the  load  characteristics  of  macro-activities  in  the  RTS  behavioural  model. 
The  structural  model  defines  the  current  design  of  the  RTS  and  considers  components 
on  four  levels  or  layers:  the  whole  RTS.  processors,  processes  (.schedulable  units)  and 
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blocks.  The  blocks  represent  code  sequences  supporting  the  execution  of  activities.  The 
behavioural  model,  on  the  other  hand,  considers  components  on  three  levels:  the  RTS 
as  a  whole,  macro-activities  and  activities.  Figure  6  shows  the  levels  considered  in  each 
model  and  the  relationships  between  them.  The  parameters  of  the  structural  and  be¬ 
havioural  models  represent  mapping  relationships  between  their  components  in  the  lev¬ 
els  of  the  corresponding  model.  Finally,  the  parameters  of  connection  between  the  mod¬ 
els  e.stablish  the  mapping  of  activities  in  blocks,  providing  the  basis  with  which  to  relate 
both  models. 

Metrics  are  the  criteria  to  explain  the  behaviour  observed  in  the  system.  They  can 
be  simple  measurements  obtained  from  the  event  trace,  or  relationships  between  the 
measurements  and  the  parameters.  Metrics  provide  information  which  feed  the  mod¬ 
els  corresponding  to  the  behavioural  and  resource  views.  The  magnitudes  employed  in 
the  construction  of  the  metrics  corresponding  to  the  behavioural  view  are  the  following: 
Initialization  time  of  macro-activity  (TiniM);  Response  time  of  macro-activity  (TresM); 
Waiting  time  of  macro-activity  (TwaiM);  Service  time  of  macro-activity  (T,serM);  Num¬ 
ber  of  executions  of  macro-activity  (NeM);  Theoretic  number  of  executions  of  macro¬ 
activity  (NeTM);  Deadline  of  macro-activity  (DM);  Number  of  failures  of  deadline  in 
macro-activity  (NtDM);  Respon.se  lime  of  activity  (Tre.sA);  Waiting  time  of  activity 
(TwaiA);  Service  time  of  activity  (T.serA);  and  Number  of  executions  of  activity  (NeA). 

On  the  other  hand,  the  new  magnitudes  employed  in  the  construction  of  the  metrics 
corresponding  to  the  resource  view  are  the  following:  Process  u.se  of  RTS  (UproS);  Pro¬ 
cessor  use  of  RTS  (UcpuS);  Set  of  processors  use  of  RTS  (UcpusS);  Time  of  processor 
use  of  macro-activity  (TcpuM);  Communication  time  of  macro-activity  (TcomM);  Pro¬ 
cess  use  of  macro-activity  (UproM);  Processor  use  of  macro-activity  (UcpuM);  Set  of 
processors  use  of  macro-activity  (UcpusM);  Index  ot  concurrence  of  macro-activity 
(IcM);  Time  of  processor  use  ol  activity  (TcpuA);  Communication  time  of  activity 
(TcomA);  Process  use  of  activity  (UproA);  Processor  use  of  activity  (UcpuA);  Index 
of  blocking  ot  activity  (IbA);  Index  of  parallelism  of  activity  (IpA);  and  Index  of  con¬ 
currence  of  macro-activity  (IcA). 

The  metrics  used  in  this  methodology  can  be  classified  according  to  the  level  of 
analysis  in  which  they  are  applied.  So.  three  different  levels  can  be  distinguished:  RTS 
level  metrics,  macro-activity  level  metrics  and  activity  level  metrics.  For  a  specific  level 
of  analysis  and  a  specific  view,  the  metrics  can  be  calculated  using  all  three  analy¬ 
sis  windows,  that  is  the  RTS  window  (Wrts),  the  macro-activity  window  (Wmac)  and 
the  activity  window  (Wact).  Tables  I  and  2  show  all  the  metrics  at  activity  level  cor¬ 
responding  to  the  behavioural  and  resource  views,  respectively.  M  and  D  prefixes  in 
metrics  refer  to  the  mean  and  deviation  values  respectively. 

The  Index  of  blocking  (Ib)  helps  lo  identify  the  cause  of  blocking  time  in  an  ac¬ 
tivity.  This  index  compares  the  activity  response  time  with  the  macro-activity  period, 
in  order  to  establish  if  the  blocking  lime  is  caused  by  overlapping  of  macro-activity 
executions  (when  the  sum  of  the  .service  and  communication  times  is  greater  than  the 
macro-activity  activation  period).  Therefore,  index  values  over  I  indicate  overlapping. 
The  Index  of  parallelism  (Ip)  provides  information  about  the  level  of  concurrence  of 
parallel  activities  (PAR  activities)  of  the  behavioural  model,  in  a  given  execution.  The 
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Activity  Level 

Wrts 

Wmac 

Wact 

MTwaiA/MTresA 

MTresA/MTresM 

DTre.sA/MTresA 

NeA/NeM 

TwniA/TresA 

TresA/TresM 

Table  1.  Behavioural  view 


Activity  Level  | 

Wrts 

Wmac 

Wact 

MTc|HlA/MTst:T,■^ 

TepuA/TserA 

DTcpuA/MTL-puA 

MTcmiiA/MTwiiiA 

TconiA/TwaiA 

MIhA 

IliA.  ipA.  IcA 

UpniA 

UproA 

UproA 

UtpuA 

UepuA 

UepuA 

Table  2.  Resource  view 


Index  of  concurrence  (Ic)  provides  infonnation  about  the  level  of  concurrence  of  all  the 
activities  composing  the  macro-activity  in  a  given  execution. 

3.5  Diagnosis  of  temporal  behaviour 

The  stages  to  follow  in  the  diagnosis  of  temporal  behaviour  are  the  following: 

-  Stage  J:  Diagnosis  of  the  RTS.  In  this  first  stage,  global  causes  of  behaviour  of  the 
set  of  macro-activities  in  the  behavioural  model  are  evaluated. 

-  Stage  2:  Identification  of  critical  macro-activities.  The  objective  of  this  stage  is  the 
identification  of  macro-activities  which  do  not  fulfill  one  or  more  of  the  real-time 
constraints  defined  in  the  specification  of  behaviour. 

-  Stage  3:  Diagnosis  of  each  critical  macro-activity.  In  this  stage,  the  causes  of  be¬ 
haviour  which  explain  the  response  time  of  each  critical  macro-activity  are  found- 

-  Stage  4:  Identification  of  critical  activities. The  objective  of  this  stage  is  the  iden¬ 
tification  of  the  most  significant  or  critical  activities  (bottlenecks)  in  each  critical 
macro-activity. 

-  Stage  5:  Diagnosis  of  each  critical  activity.  In  this  last  stage,  the  causes  of  be¬ 
haviour  which  explain  the  response  time  of  each  critical  activity  are  found. 

The  support  provided  by  parameters  and  metrics  in  the  diagnosis  is  shown  in  fig¬ 
ure  7.  The  figure  resumes  the  parameters  and  metrics  (including  the  analysis  window 

considered  for  them)  useful  at  each  diagnosis  stage. 

When  considering  the  causes  of  temporal  behaviour  in  the  diagnosis,  the  differences 
between  the  causes  at  activity  level  and  the  causes  at  macro-activity  and  RTS  levels  must 
be  clearly  established.  At  activity  level,  the  response  time  of  the  critical  activities  must 
be  explained,  and  three  levels  of  diagnosis  are  considered.  The  first  level  of  diagnosis 
evaluates  the  waiting  and  service  times  of  each  activity.  Evaluation  of  communication 
and  blocking  times  during  waiting  time,  processing  during  service  time  and  resource 
contention  during  both  waiting  and  service  times,  correspond  to  the  second  level  of 
diagnosis.  Finally,  the  third  level  of  diagnosis  evaluates  the  specific  causes  of  commu¬ 
nication,  processing,  blocking  or  contention.  Based  on  these  three  diagnosis  levels,  a 
set  of  causes  of  temporal  behaviour  ot  the  critical  activities  can  be  established.  Figure  8 
represents  the  three  levels  of  diagnosis.  In  table  3  all  the  causes  considered,  with  a  brief 
description  of  each,  are  detailed.  The  incidence  of  each  cause  ol  behaviour  is  evaluated 
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Parttneters 


Fig.  7.  Parameter  imd  metric  support  in  diagnosis 


with  a  value  in  the  range  of  0—1 .  This  value  of  incidence  is  the  product  of  the  partial 
incidence  values  obtained  at  each  level  of  diagnosis. 

At  RTS  level,  the  causes  of  behaviour  of  the  RTS  as  a  whole  must  be  found,  whereas 
at  macro-activity  level,  the  response  time  of  critical  macro-activities  must  be  explained. 
To  achieve  those  objectives,  the  aggregation  of  macro-activity  and  activity  metrics  are 
considered  respectively.  Therefore  it  is  not  possible  to  distinguish  all  the  causes  of  be¬ 
haviour  considered  at  activity  level.  Only  the  first  two  levels  oi  diagnosis  are  consid¬ 
ered,  and  so  the  cau.ses  of  behaviour  are  reduced  to  five.  These  causes  are  the  following: 

(initialization  time  of  macro-activities);  WALCOM  (communication  during  wait¬ 
ing  time);  WAI_BLO  (blocking  during  waiting  time,  contention  included);  SER.CON 
(contention  during  service  time);  and  .SER_PRO  (processing  during  service  time); 


3.6  Configuration 

Once  the  incidence  of  the  causes  of  the  behaviour  of  each  critical  activity  composing 
the  critical  macro-activities  has  been  established,  those  with  high  incidence  will  be  con¬ 
sidered,  in  order  to  tune  the  RTS  design  and  establish  its  proper  configuration.  The  goal 
of  configuration  can  be  either  the  fulhllment  o1  the  real-time  constraints  using  the  avail¬ 
able  resources,  or  the  reduction  of  resources  in  the  RTS  design,  while  maintaining  the 
fulfillment  of  the  real-time  constraints.  The  proper  design  alternatives  for  the  causes  of 
behaviour  of  critical  activities  within  the  critical  macro-activities  are  the  following:  Re¬ 
mapping  of  blocks  on  processes  (RBL):  Re-mapping  of  processes  on  processors  (RPR); 
Change  of  process  priority  (CPR);  Segmentation  of  an  activity  (SEG);  Replication  of  an 
activity  (REP);  Parallelization  of  an  activity  (PAR);  Balance  of  load  in  PAR  activities 
(BAL);  and  Optimization  of  block  code  (OPT). 
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Description 

Communication  with  other  activities  of  the  same  Macro'iictivlfy.  during  the  Waiting  time  of  the  activity 

wes 

Communication  for  Synchronization  with  oiher  niacro-iictiviiies  during  the  Waiting  lime 

WTB 

ConTention  of  CPU  due  to  competition  with  other  activities  supponed  by  the  same  Block  during  the  Waiting  time 

ConTention  of  CPU  due  to  competition  with  activities  supported  by  other  blocks  of  the  same  Process 

WTH 

ConTention  of  CPU  due  to  competition  with  Higher  priority  activities  during  the  Waiting  time 

WTE 

ConTention  of  CPU  due  to  competition  with  aciivitie.s  of  Equal  priority  during  the  Waiting  time 

Blocking  during  the  Waiting  time  due  to  execution  Overlapping  in  the  macro-activity 

Blocking  during  the  Wailing  time  due  to  synchronization  with  other  macro-activities 

Blocking  during  the  Waiting  time  of  a  PAR  activity  caused  by  the  previous  synchronization  of  other  PAR  activities 

Blocking  during  the  Waiting  tin>e  due  to  other  overheads  associated  with  Communication 

|§yQ|[| 

ConTention  of  CPU  due  to  competition  witli  Higher  priority  activities  during  the  Service  time  of  the  activity 

ConTention  of  CPU  due  to  competition  with  activities  of  Equal  priority  during  the  Service  time 

Load  imbalance  of  a  PAR  activity  during  the  Proces.sing  part  of  the  Service  time 

Execution  of  code  during  the  Proces.sing  |>:irt  of  the  Service  lime 

Table  3.  Causes  of  bcliarioHi:  Activity  level 


4  Case  study 

The  case  study  considered  here  to  show  the  use  of  the  analysis  methodology  and  de¬ 
scribed  below  has  been  widely  studied  in  other  papers,  such  as  [6]. 

A  Remote  Speed  Sensor  (RSS)  measures  the  speed  of  a  number  of  motors,  and 
reports  them  to  a  remote  host  computer.  The  speed  of  each  motor  is  obtained  by  period¬ 
ically  reading  a  corresponding  digital  tacometer.  The  interval  between  speed  readings 
(10- 1000  ms.)  for  each  motor  is  specified  by  the  host  computer.  An  Analogic  to  Digital 
Converter  (ADC)  with  a  set  of  multiplexed  channels  is  used  to  measure  the  speed  signal 
provided  by  tacometers  coupled  to  the  motors.  The  ADC  accepts  reading-requests  in 
the  form  of  motor  numbers  (integers  in  the  range  of  0-15).  After  a  request  has  been 
received,  the  converter  reads  the  speed  of  the  motor,  stores  it  in  a  hardware  buffer,  and 
generates  an  interruption.  The  convener  can  only  read  the  speed  of  one  motor  at  a  time. 
The  interval  between  readings  for  a  given  motor  is  specified  in  a  control  packet  which 
is  sent  from  the  host  computer  to  the  RSS.  The  speed  of  a  motor  is  reported  to  the  host 
via  a  data  packet.  When  a  control  packet  is  received  from  the  host,  it  is  checked  for 
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validity.  If  the  message  is  valid,  an  acknowledgment  (ACK)  is  sent  to  the  host.  If  it  is 
not,  a  negative  acknowledgment  (NAK)  is  sent.  When  a  data  packet  is  sent,  the  RSS 
waits  to  receive  either  an  ACK  or  a  NAK  from  the  host.  If  a  NAK  is  received,  or  neither 
an  ACK  nor  a  NAK  is  received  within  half  the  reading  interval,  the  message  emission 
is  marked  as  a  failure. 


4.1  Design  structure 

The  design  structure  of  the  case  study  has  four  principal  parts;  the  tacometer,  the  mo¬ 
tors,  the  host  interface  and  the  host.  Each  of  these  parts  is  composed  of  one  or  more 
software  processes,  as  seen  in  Figure  9.  The  tacometer  process  can  access  the  speeds 
of  all  the  motors  and  is  constantly  waiting  to  read  requests.  When  a  request  is  received, 
it  reads  the  corresponding  speed  and  sends  the  data  to  the  motor  process.  Motor  pro¬ 
cesses,  one  per  motor,  periodically  send  the  reading-requests  to  the  tacometer  process. 
Once  data  is  received,  it  is  filtered  and  sent  to  the  host.  These  processes  have  associated 
processes  which  inform  them  about  new  reading  intervals  requested  from  the  host.  The 
Host  interface  is  implemented  with  four  processes:  inport,  outport,  inmsg  and  outmsg. 
Inport  receives  packets  from  the  host.  If  the  packet  is  a  control  packet,  it  is  sent  to  inmsg. 
If  it  is  an  ACK  or  a  NAK,  it  is  sent  to  outmsg.  Outport  receives  packets  from  outmsg 
and  inmsg  and  sends  them  to  the  host.  Inmsg  receives  control  packets  from  inport.  If 
the  packet  is  not  valid,  it  sends  NAK  to  the  host  through  outport.  If  it  is  valid,  it  sends 
ACK  to  the  host  through  outport  and  the  new  interval  to  the  motor  process.  Outmsg  re¬ 
ceives  speeds  from  the  motor  processes  and  ACKs  and  NAKs  from  inport.  All  of  them 
are  sent  to  the  host.  The  host  has  three  processes.  Phost  is  the  main  process  and  phostin 
and  phostout  are  input  and  output  processes  for  communication  with  their  correspond¬ 
ing  processes  in  the  RSS  host  interface.  The  embedded  system  was  implemented  in  a 
PARSYS  SN9500  machine,  a  distributed  memory  parallel  machine  based  on  Transput¬ 
ers  with  8  CPUs.  In  this  machine  the  process  communications  are  established  through  a 
virtual  channel  network.  RSS  processes  were  implemented  over  4  CPUs.  For  simplicity 
in  the  case  study,  host  processes  were  also  implemented  in  the  same  machine  using  an 
extra  CPU. 


4.2  Behavioural  model  and  real-time  constraints 

According  to  the  methodology,  the  relevant  action-reaction  event  couples  which  de¬ 
mand  system  response  must  first  be  identified.  The  action  events  are  the  periodic  re¬ 
quests  for  speed-reading  from  the  motor  processes  to  the  tacometer  process.  The  reac¬ 
tion  events  are  the  arrivals  at  outmsg  process  of  ACKs  coming  from  the  host  in  response 
to  the  emission  of  packets  with  speed  data.  A  total  of  16  kinds  of  events,  one  per  motor, 
are  considered.  The  macro-activities  include  all  the  activities  executed  from  the  speed¬ 
reading  request  to  the  ACK  reception  from  the  host.  All  16  macro-activities  considered 
in  the  analysis  have  the  same  structure,  as  shown  in  Figure  10,  corresponding  to  the 
RTS  behavioural  model. 

The  real-time  constraints  of  each  macro-activity  in  the  case  study  depend  on  the 
speed  of  the  corresponding  motor.  The  deadline  of  each  macro-activity  has  the  same 
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Fig.  9.  Process  Structure 


Fig.  10.  Behavioural  model 


value  as  its  period  of  execution,  as  shown  in  table  5,  where  Mac.  refers  to  the  macro¬ 
activity  number,  and  times  are  given  in  milliseconds. 


CPU 

Processes 

Activities 

Nblocks 

T90(X)[0] 

phost 

phostin 

phostout 

[i,.S)  i=0-l.S 

[i.4]i=0-I.S 
[i.6]  i=0-l.‘) 

1 

1 

1 

T90(X)[I] 

inport 

outport 

inmsg 

outmsg 

(i,71  i=0-l,S 
[i..7]  i=0-l.‘i 

[i.2]  [i,8]  i=0-I.S 

■ 

T9000[2] 

motor[i] 

[i.l]  i=0.2..,(even) 

8 

T9000[.’?] 

moior[i] 

[i,l]  i=l,.r..(odd) 

8 

T9000[4] 

tacometer 

(i.0]i=0-I.S 

1 

Mac. 

Period 

DM 

Mac. 

Period 

DM 

0 

12 

12 

g 

80 

80 

1 

20 

20 

9 

100 

100 

2 

2.S 

2.S 

mm 

120 

120 

3 

.70 

I] 

1.S0 

I.S0 

4 

40 

12 

200 

5 

.SO 

■1 

1.7 

700 

6 

60 

■1 

14 

7 

7.S 

I.S 

Table  4.  Real-time  constraints 


Table  5.  Parameters 


4.3  Analysis 

Table  4  resumes  the  main  parameters:  parameters  of  the  structural  model,  parameters 
of  the  behavioural  model  and  parameters  of  connection  between  the  models.  It  shows 
the  mapping  of  processes  in  processors,  the  mapping  of  activities  in  the  processes  and 
the  number  of  blocks  giving  support  to  the  activities  (Nblocks).  Index  i  refers  to  macro¬ 
activities  and  activities  [i,l]  are  the  only  ones  supported  in  various  independent  blocks 
and  processes. 
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Diagnosis 

Stage  ] .  In  Table  6  some  of  the  RTS  level  metrics  in  the  RTS  analysis  window  are 
shown.  A  significant  synchronization  load  from  the  relative  value  of  Twai  can  be  ob¬ 
served.  A  very  low  rate  of  deadline  failures  and  a  medium  level  use  of  the  set  of  proces¬ 
sors  are  also  observed.  Finally,  Table  7  shows  the  incidence  of  the  causes  of  behaviour 
for  the  RTS  as  a  whole.  Here,  blocking  is  the  most  important  cause  of  behaviour. 


Metric 

Value 

M(MTiniM/MTresM) 

0.02,“! 

M(MTwaiM/MTresM) 

0.606 

M(MTresM/DM) 

0.21.1 

M(NeM/NeTM) 

1.000 

M(NfDM/NeM) 

0.001 

M(MTcomM/MTwaiM) 

0.183 

M(MTcpuM/MTserM) 

0.778 

UcpusS 

0„S06 

Table  6.  RTS  metrics 


Cause 

Incidence 

INI 

00.3 

WAI.COM 

0.11 

WA1.BLO 

0..S0 

SER.CON 

0.08 

SER.PRO 

0.29 

Table  7.  RTS  diagnosis 


Stage  2.  In  Figure  1 1  the  relative  value  of  macro-activity  response  times  with  regard  to 
their  deadlines  is  shown.  Only  macro-activity  0  in  its  macro-activity  window  exceeds 
its  deadline.  The  rest  of  metrics  establish  that  the  constraint  of  productivity  absorption 
is  fulfilled  in  all  macro-activities  and  the  deadline  constraint  is  not  fulfilled  in  macro¬ 
activity  0.  So,  macro-activity  0  is  selected  as  a  critical  macro-activity. 


Fig.  11.  Macro-activity  level  metrics  Fig.  12.  Activity  level  metrics 


Stage  3.  Processing  (0.52  incidence),  blocking  (0.29  incidence)  and  communication 
(0. 1 1  incidence)  result  to  be  the  principal  causes  of  the  behaviour  in  the  macro-activity 
window  for  critical  macro-activity  0. 
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Causes  of  behaviour 
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- 
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Hi 
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- 

- 

- 
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- 

- 

- 
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- 
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Hil 

Par 

- 

- 

- 
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- 

- 

- 

- 

- 
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Exe 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

■flii 

TOTAL 

SHI 

3HI 

Hil 

hbebi 

ESOBBBI 

Table  8.  Diagnosis  of  behaviour  for  activity  ] 


Stage  4.  Activity  level  metrics  considered  here  correspond  to  activities  composing  the 
critical  macro-activity  0.  In  figure  12  the  relative  value  of  each  activity  response  time 
with  respect  to  the  macro-activity  response  time  in  the  activity  and  RTS  windows  is 
shown.  Activity  1  is  seen  to  be  the  longest  activity  with  nearly  40%  incidence  in  the 
macro-activity  response  time.  So,  activity  1  is  selected  as  a  critical  activity. 

Stage  5.  Table  8  shows  the  incidence  of  the  causes  of  the  behaviour  in  the  activity 
window  of  critical  activity  1.  It  also  shows  the  partial  incidence  of  each  diagnosis  level. 
The  main  causes  of  behaviour  are:  contention  with  other  activities  of  equal  priority 
(the  corresponding  activities  [i,l]  in  the  other  macro-activities)  supported  by  different 
processes;  and  execution  of  code.  Both  causes  correspond  to  the  service  time. 

Configuration.  The  parameters  show  that  critical  activity  1  of  critical  macro-activity 
0  is  placed  in  CPU2.  Checking  the  use  of  this  CPU  by  each  macro-activity  in  the  crit¬ 
ical  activity  window,  macro-activities  2  and  10  can  be  observed  in  strong  competition 
with  macro-activity  0.  A  new  mapping  with  motor  processes  motor[2]  and  motorflO]  in 
CPU2  eliminates  deadline  failures  in  all  macro-activities. 
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5  Conclusions  and  future  work 

An  analysis  methodology  applicable  to  configuration  of  parallel,  real-time  and  embed¬ 
ded  systems  has  been  presented.  The  methodology  is  based  on  temporal  behaviour  anal¬ 
ysis  of  the  system  and  considers  a  behavioural  model  of  the  RTS  composed  of  real-time 
macro-activities  giving  response  to  the  input  events.  The  methodology  involves  the  con¬ 
struction  of  a  synthetic  prototype  for  the  initial  design  of  the  RTS,  which  is  refined  until 
the  final  implementation.  Configuration  of  the  RTS  is  achieved  previous  diagnosis  of 
its  temporal  behaviour,  based  on  a  set  of  parameters  and  metrics  covering  three  com¬ 
plementary  views  of  the  system:  behavioural  view,  structural  view  and  resource  view. 
A  well  known  case  study  has  been  analyzed  with  the  methodology  and  demonstrated 
its  possibilities  in  configuration  of  RTS. 

Future  work  has  two  main  objectives.  Firstly,  to  derive  automatic  rules  for  proper 
configuration  of  systems  from  the  expertise  gained  with  the  use  of  the  methodology. 
Secondly,  to  apply  the  methodology  to  real-time  POSIX  applications  implemented  with 
either  parallel  or  distributed  architectures. 
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Abstract.  For  more  than  20  years,  several  research  works  have  been 
carried  out  to  design  algorithms  for  image  synthesis  able  to  produce  pho¬ 
torealistic  images.  To  reach  this  level  of  perfection,  it  is  necessary  to  use 
both  a  geometrical  model  which  accurately  represents  an  existing  scene 
to  be  rendered  and  a  light  propagation  model  which  simulates  the  light 
propagation  into  the  environment.  These  two  requirements  imply  the 
use  of  high  performance  computers  which  provide  both  a  huge  amount 
of  memory  to  store  the  geometrical  model  and  fast  processing  elements 
for  the  computation  of  the  light  propagation  model.  Moreover,  parallel 
computing  is  the  only  available  technology  which  satisfies  these  two  re¬ 
quirements.  Since  1985,  several  technology  tracks  have  been  investigated 
to  design  efficient  parallel  computers.  This  variety  forced  designers  of 
parallel  algorithms  for  image  synthesis  to  study  several  strategies.  This 
paper  present  these  different  parallelisation  strategies  for  the  two  well 
known  computer  graphics  techniques:  ray-tracing  and  radiosity. 


Keywords:  parallel  rendering,  ray-tracing,  radiosity 


1  Introduction 

Since  the  beginning  of  the  last  decade,  a  lot  of  research  works  were  made  to 
design  fast  and  efficient  rendering  algorithms  for  the  producing  of  photorealistic 
images.  Such  efforts  were  aimed  at  both  having  more  realistic  light  propagation 
models  and  at  reducing  the  algorithm  complexity.  Such  objectives  were  driven 
by  the  need  to  produce  high  quality  images  having  several  millions  of  polygons. 
Despite  these  efforts  and  the  increasing  performance  of  new  microprocessors, 
computation  times  remains  at  unacceptable  level.  Using  both  a  realistic  light 
propagation  model  and  an  accurate  geometrical  model  require  huge  computing 
resources  both  in  term  of  computing  power  and  memory.  Only  parallel  comput¬ 
ers  can  provide  such  resources  to  produce  images  in  a  reeisonable  time  frame.  For 
more  than  10  years,  the  design  of  parallel  computers  was  in  constant  evolution 
due  to  the  availability  of  new  technologies.  During  the  last  decade,  most  of  the 
parallel  computers  were  based  on  the  distribution  of  memories,  each  processor 
having  its  own  local  memory  with  its  own  address  space.  Such  Distributed  Mem¬ 
ory  Parallel  Computers  (DMPC)  have  to  programmed  using  a  communication 
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model  based  on  the  exchange  of  messages.  Such  machine  were  either  SIMD  (Sin¬ 
gle  Instruction  Multiple  Data)  or  MIMD  (Multiple  Instruction  Multiple  Data). 
Among  those  machines,  one  can  cite  the  latest  machines  CM-5,  Paragon  XP/S, 
Meiko  CS-2,  IBM  SP-2.  More  recently,  a  new  kind  of  parallel  systems  were  avail¬ 
able.  Scalable  Shared  Memory  Parallel  Computers  (SMPCs)  are  still  based  on 
distributed  memories  but  provide  a  single  address  space.  So  that,  parallel  pro¬ 
gramming  can  be  performed  using  shared  variables  instead  of  message  passing. 
The  latest  SMPCs  are  HP/Convex  Exemplar  or  SGI  Origin  2000.  Designing  par¬ 
allel  rendering  algorithms  for  such  various  machines  is  not  a  simple  task  since, 
to  be  efficient,  algorithms  have  to  take  benefits  of  the  specificities  of  each  of 
these  parallel  machines.  For  instance,  the  availability  of  a  single  address  space 
may  simplify  greatly  the  design  of  a  parallel  rendering  algorithm.  This  paper 
aims  at  presenting  different  parallelisation  strategies  for  two  well  known  ren¬ 
dering  techniques:  ray-tracing  and  radiosity.  These  two  techniques  address  two 
different  problems  when  realistic  images  have  to  be  generated.  Ray-tracing  is 
able  to  take  into  account  direct  light  sources,  transparency  and  specular  effects. 
However,  such  technique  does  not  take  into  account  indirect  lights  coming  from 
the  objects  belonging  to  the  scene  to  be  rendered.  This  problem  is  addressed 
by  the  radiosity  technique  which  is  able  to  simulate  one  of  the  most  impor¬ 
tant  form  of  illumination,  the  indirect  ambient  illumination  provided  by  light 
reflected  among  the  many  diffuse  surfaces  that  typically  make  up  an  environ¬ 
ment.  The  paper  is  organised  as  follows.  The  next  section  gives  some  insights 
on  parallelisation  techniques.  Section  3  gives  an  overview  of  techniques  for  the 
parallelisation  of  both  the  ray-tracing  algorithms  and  the  radiosity  algorithms. 
Section  4  describes  briefly  a  technique  we  designed  to  enhance  data  locality  for 
a  progressive  radiosity  algorithm.  Conclusions  are  presented  in  section  5. 

2  Parallelisation  techniques 

Using  parallel  computers  require  the  parallelisation  of  the  algorithm  prior  to  their 
execution.  Due  to  the  different  technology  tracks  followed  by  peirallel  computer 
designers,  such  parallel  algorithms  have  to  deal  with  different  parallel  program¬ 
ming  paradigms  (message-passing  or  shared  variables).  However,  although  the 
paradigms  are  different,  designing  efficient  parallel  algorithms  require  to  pay  at¬ 
tention  to  data  locality  and  load-balancing  issues.  Exploiting  data  locality  aims 
at  reducing  communication  between  processors.  Such  communication  can  be  ei¬ 
ther  message-passing  when  there  are  several  disjoint  address  spaces  or  remote 
memory  access  when  a  global  address  space  is  provided.  Data  locality  can  be 
exploited  using  different  ways. 

A  first  approach  consist  in  distributing  data  among  processors  followed  by  the 
distribution  of  computations.  This  later  distribution  has  to  be  performed  in  such 
a  way  that  computations  assigned  to  a  processor  will  access  as  much  as  possible 
local  data  which  have  been  previously  distributed.  It  consists  in  partitioning  the 
data  domain  of  the  algorithm  into  sub-domains.  Each  of  them  is  associated  with 
a  processor.  Computations  are  assigned  to  the  processor  which  owns  the  data 
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involved  by  these  computations.  If  these  latter  generate  new  works  that  might 
require  some  other  data  not  located  in  the  processor,  they  are  sent  to  relevant 
processors  by  means  of  messages  or  remote  memory  accesses.  This  approach  is 
called  data  oriented  parallelisation. 

The  second  approach  consist  in  distributing  computations  to  processors  in 
such  a  way  that  computations  assigned  to  a  processor  will  reused  as  much  as 
possible  the  same  data.  Such  approach  requires  a  global  address  space  since 
data  have  to  be  fetched  and  cached  into  a  local  processor.  Such  approach  can  be 
applied  to  the  parallelisation  of  loops.  Loops  are  analysed  in  order  to  discover 
dependencies.  A  set  of  tasks  is  then  created,  representing  a  subset  of  iterations. 
This  approach  is  called  control  oriented  parallelisation. 

The  last  parallel  approach,  also  called  systolic  parallelisation,  breaks  down 
the  algorithm  into  a  set  of  tasks,  each  one  being  associated  with  one  processor. 
The  data  are  then  passed  from  processor  to  processor.  A  simplified  form  of 
systolic  parallelisation  is  the  well  know  pipelining  technique. 

For  DMPCs,  the  physical  distribution  of  processing  elements  makes  data 
oriented  parallelisation  the  natural  way.  However,  to  be  efficient,  such  technique 
has  to  be  applied  to  algorithms  where  the  relationship  between  computation 
and  data  accesses  is  known.  If  such  relationship  is  unknown,  achieving  a  load 
balancing  will  be  a  tough  problem  to  solve.  For  SMPCs,  the  availability  of  a 
global  shared  address  space  makes  this  task  easier.  When  a  processor  has  not 
enough  computations  to  perform,  it  can  synchronise  with  other  processors  to  get 
more  computations.  However,  such  approach  sulfer  by  an  increasing  number  of 
communications,  since  the  idle  processor  will  have  to  get  data  used  by  these  new 
computations.  A  tradeoff  has  often  to  be  found  to  get  the  maximum  performance 
of  the  machine. 

3  Parallel  rendering 

3.1  Ray-tracing 

Principle  The  ray  tracing  algorithm  is  used  in  computer  graphics  for  rendering 
high  quality  images.  It  is  based  on  simple  optical  laws  which  take  effects  such  as 
shading,  reflection  and  refraction  into  account.  It  acts  as  a  light  probe,  following 
light  rays  in  the  reverse  direction.  The  basic  operation  consists  in  tracing  a  ray 
from  an  origin  point  towards  a  direction  in  order  to  evaluate  a  light  contribution. 
Computing  realistic  images  requires  the  evaluation  of  several  million  light  con¬ 
tributions  to  a  scene  described  by  several  hundred  thousand  objects.  This  large 
number  of  ray /object  intersections  makes  ray  tracing  a  very  expensive  method. 
Several  attempts  have  been  proposed  to  reduce  this  number.  They  are  based  on 
an  object  access  data  structure  which  allows  a  fast  search  for  objects  along  a  ray 
path.  These  data  structures  are  based  either  on  a  tree  of  bounding  boxes  or  on 
space  subdivision. 

Parallelisation  strategies  Ray  tracing  is  intrinsically  parallel  since  the  eval¬ 
uation  of  one  pixel  is  independent  of  the  others.  The  difficulty  in  exploiting  this 
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parallelism  is  to  simultaneously  ensure  that  the  load  be  balanced  and  that  the 
database  be  distributed  evenly  among  the  memory  of  the  processors.  The  paral¬ 
lelisation  of  such  an  algorithm  raises  a  classical  problem  when  using  distributed 
parallel  computers:  how  to  ensure  both  data  distribution  and  load  balancing 
when  no  obvious  relation  between  computation  and  data  can  be  found?  This 
problem  can  be  illustrated  by  the  following  schematic  ray  tracing  algorithm: 

for  2  =  1,  xpix  do 
for  j  =  l,ypix  do 

pixel[i,j]  =  S{contrib{. . . ,  space[f:,{.  ■  ■),  fy{. . .),  /,(. . .)], . . .)) 

done 

done 

The  computation  of  one  pixel  is  the  accumulation  of  various  light  contribu¬ 
tions  contrib{)  depending  on  the  lighting  model.  Their  evaluations  require  the 
access  to  a  database  which  models  the  scene  to  be  rendered.  In  the  ray  tracing 
algorithm,  the  database  space  is  both  an  object  access  data  structure  (space 
subdivision)  and  objects.  The  data  accesses  entail  the  evaluation  of  functions 
Sx,  fy  and  /j.  These  functions  are  known  only  during  the  execution  of  the  ray 
tracing  algorithm  and  depend  on  the  ray  paths.  Therefore,  relationships  between 
computation  and  data  are  unknown. 

Parallelisation  strategies  explained  in  section  2  can  be  applied  to  the  ray¬ 
tracing  in  the  following  manner.  A  data  oriented  parallelisation  approach  con¬ 
sists  in  distributed  geometrical  objects  and  their  associated  data  structures  (tree 
of  extents  or  space  subdivision)  among  the  local  memories  of  a  DMPC.  Each 
processor  is  assigned  one  part  of  the  whole  database.  There  are  mainly  two 
techniques  for  distributing  the  database,  depending  on  the  objects  access  data 
structure  which  is  chosen.  The  first  one  partitions  the  scene  according  to  a  tree  of 
extents  while  the  second  subdivides  the  scene  extent  into  3D  regions  (or  voxels). 
Rays  are  communicated  as  soon  as  they  leave  out  the  region  associated  with  one 
processor.  From  now  on,  this  technique  will  be  named  processing  with  ray 
dataflow. 

The  control  oriented  parallelisation  consist  in  distributing  the  two  nested 
loops  of  the  ray-tracing  algorithms  as  shown  previously.  Such  technique  may 
apply  to  DMPCs  but  requires  the  duplication  of  the  entire  object  data  struc¬ 
ture  in  the  local  memory  of  each  processor.  In  that  case,  there  is  no  dataflow 
between  processors  since  each  of  them  has  the  whole  database.  Pixels  are  dis¬ 
tributed  to  processors  using  a  master/slave  approach.  However,  the  limited  size 
of  the  local  memory  associated  with  each  processor  of  a  DMPC  prohibits  its  use 
for  rendering  complex  scenes.  A  more  realistic  approach  consists  in  emulating 
a  shared  memory  when  using  a  DMPC  or  to  choose  a  SMPC  which  provides  a 
single  address  space.  The  whole  database  is  stored  in  the  shared  memory  and 
accessed  whenever  it  is  needed.  As  said  in  section  2,  such  technique  relies  mainly 
of  the  exploitation  of  data  locality.  Ray-tracing  has  such  property.  Indeed,  two 
rays  shot  from  the  observer  through  two  adjacent  pixels  have  a  high  probabil¬ 
ity  of  intersecting  the  same  objects.  This  property  is  also  true  for  all  the  rays 
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spawned  from  the  two  primary  rays.  Such  property  can  be  exploited  to  limit 
the  number  of  remote  accesses  when  computing  pixels.  Those  algorithms  use 
a  scheme  of  processing  with  object  dateiflow.  Both  data  oriented  and  con¬ 
trol  oriented  approaches  can  be  used  simultaneously.  Such  hybrid  approach  has 
been  investigated  recently  to  provide  a  better  load  balance  for  the  design  of  a 
scalable  and  efficient  parallel  implementation  of  ray-tracing.  Concerning  systolic 
oriented  parallelisation,  several  studies  have  been  carried  out  but  the  irregular 
nature  of  ray-tracing  algorithms  make  such  approach  ineffective.  Table  1  gives 
a  list  of  references  dealing  with  the  parallelisation  of  ray-tracing  depending  on 
their  parallelisation  strategies. 


Type  of  parallelism 

Communication 

Data  Structure 

References 

Control 

No  dataflow 

Nishimura  et  al.  [29] 

Tree  of  extents 

Bouville  et  al.  [7] 
Naruse  et  aJ.  [27] 

Object  dataflow 

Space  subdivision 

Green  et  al.  [21, 20] 
Badouel  et  al.  [4, 3] 
Keates  et  al.  [24] 

Bounding  volumes 

Data 

Ray  dataflow 

Space  subdivision 

Dippe  et  al.  [16] 
Cleary  et  al.  [12] 

Isler  et  al.  [23] 
Nemoto  et  al.  [28] 
Kobayashi  et  al.  [25] 
Priol  et  al.  [31] 
Caubet  et  al.  9] 

Tree  of  extents 

Salmon  et  td.  [36] 
Caspary  et  al.  [8] 

Hybrid 

Object -fRay  dataflow 

Space  subdivision 

Reinhard  et  al.  [34] 

Table  1.  Parallel  ray  tracing  algorithms. 


3.2  Radiosity 

Principle  Contrary  to  the  ray-tracing  technique,  the  radiosity  method  does  not 
produce  an  image.  It  is  a  technique  aiming  at  computing  the  indirect  ambient 
illumination  provided  by  inter-reflections  of  lights  between  diflFuse  objects.  Such 
computations  are  independent  of  the  view  direction.  Once  such  computations 
have  been  performed,  the  image  of  the  scene  is  then  computed  by  applying 
Gouraud  shading  or  ray-tracing.  The  radiosity  method  assumes  that  all  surfaces 
are  perfectly  diffuse,  i.e.  they  reflect  light  with  equal  radiance  in  all  directions. 
The  surfaces  are  subdivided  into  planar  patches  for  which  the  radiosity  at  each 
point  is  assumed  to  be  constant.  The  radiosity  computation  can  be  represented 
by  this  equation; 

j=N 

Bi  =  Ei-v  Pi FjiBj 
j=i 
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where, 

-  Bi  :  Exitance  of  patch  i  (Radiosity)  ; 

-  Ei'.  self-emitted  radiosity  of  patch  i\ 

-  Pi  :  reflectivity  of  patch  i; 

-  Fji  :  form-factor  giving  the  fraction  of  the  energy  leaving  patch  j  that  arrives 
at  patch  i; 

-  N  :  number  of  patches. 

The  solution  of  this  system  is  the  patch  radiosities  which  provide  a  discrete 
representation  of  the  diffuse  shading  of  the  scene.  In  this  equation,  the  most 
important  component  is  the  calculation  of  the  form  factor  Fji  which  gives  the 
fraction  of  the  energy  leaving  patch  j  that  arrives  at  patch  i.  The  computation 
of  each  form  factor  corresponds  to  the  evaluation  of  an  integral  which  represent 
the  major  computation  bottleneck  of  the  radiosity  method.  Form  factor  compu¬ 
tations  are  Ccirried  out  using  several  projection  techniques  such  as  the  hemi-cube 
or  the  hemisphere  [14].  Form-factors  must  be  computed  from  every  patch  to  ev¬ 
ery  other  patch  resulting  in  memory  and  time  complexities  of  O(n^).  The  very 
large  memory  required  for  the  storage  of  these  form-factors  limits  the  radiosity 
algorithm  practically.  This  difficulty  was  addressed  by  the  progressive  radiosity 
approach  [13].  In  the  conventional  radiosity  approach,  the  system  of  radiosity 
equations  is  solved  using  Gauss-Siedel  method.  At  each  step  the  radiosity  of 
a  single  patch  is  updated  based  on  the  current  radiosities  of  all  the  patches. 
At  each  step,  illumination  from  all  other  patches  is  gathered  into  a  single  re¬ 
ceiving  patch.  Progressive  radiosity  method  can  be  represented  by  the  following 
schematic  algorithm  (N  is  the  number  of  patches): 


real  Fi[N]; 
real  /IRad; 
real  B[N]; 
real  AB[N]; 

for  all  patches  i  do 
^B[i]  =  Ep, 

/*  iterative  resolution  process  */ 
while  no  convergence()  do  { 
i  =  patch-of-max-flux()  ; 
compute  form-factors  Fi[j]; 
for  all  patches  j  do  { 

/IRad  =  pjAB[\]  X  Fi[j]Ai/Aj; 
AB^]  =  Z\B[j]  +  ARad  ; 

B[j]  =  B[i]  -f  ZlRad  ; 

} 

AB[i]  =  0.0; 


/*  column  of  form-factors  */ 

/*  array  of  radiosities  */ 

/*  array  of  delta  radiosities  */ 

/*  Initialisation:  delta  radiosity  =self-emittance  */ 

I*  emission  loop  */ 

/*  form-factor  loop  */ 

/*  update  loop  */ 

/*  end  of  update  loop  */ 

/*  end  of  emission  loop  */ 
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At  each  step,  the  illumination  due  to  a  single  patch  is  distributed  to  all  other 
patches  within  the  scene.  Form  factors  can  be  computed  either  using  a  hemicube 
with  classical  projective  rendering  techniques  or  using  a  hemisphere  with  a  ray¬ 
tracing  technique.  In  the  first  steps,  the  light  source  patches  are  chosen  to  shoot 
their  energy  since  the  other  patches  will  have  received  very  little  energy.  The 
subsequent  steps  will  select  secondary  sources,  starting  with  those  surfaces  that 
receive  the  most  light  directly  from  the  light  sources,  and  so  on.  Each  step 
increases  the  accuracy  of  the  result  that  can  be  displayed.  Useful  images  can 
thus  be  produced  very  early  in  the  shooting  process.  Note  that,  at  each  step, 
only  a  column  of  the  system  matrix  is  calculated,  avoiding  thus  the  memory 
storage  problem. 


Parallelisation  strategies  The  parallel  radiosity  algorithms  proposed  in  the 
literature  are  difficult  to  classify,  since  numerous  criteria  can  be  considered:  tar¬ 
get  architectures  (SMPC,  DMPC),  type  of  parallelism  (control  oriented,  data 
oriented,  systolic),  level  at  which  the  parallelism  is  exploited,  form-factor  cal¬ 
culation  method,  etc...  The  most  important  problem  which  arises  when  paral¬ 
lelising  radiosity  is  the  data  access.  Indeed,  the  form-factor  calculation  and  the 
radiosity  update  require  the  access  to  the  whole  database.  Another  important 
point  to  be  accounted  for  is  the  level  at  which  the  parallelism  must  be  exploited 
in  the  algorithm.  Indeed,  a  coarse  grain  reduces  the  cost  entailed  by  managing 
the  parallelism,  but  generates  less  parallelism  than  a  fine  grain.  Moreover,  data 
locality  has  to  be  exploited  to  keep  as  low  as  possible  the  amount  of  communi¬ 
cations  between  processors.  In  the  radiosity  algorithm,  choosing  a  coarse  grain 
parallelism  would  not  allow  the  exploitation  of  the  data  locality.  Therefore,  a 
trade-off  between  selecting  a  grain  size  and  exploiting  data  locality  has  to  be 
found.  Several  parallelisation  strategies  have  been  studied  for  both  SMPCs  and 
DMPCs.  Table  1  gives  references  for  some  of  these  studies.  Most  of  them  focuses 
on  the  parallelisation  of  the  progressive  radiosity  approach.  In  such  approach, 
several  levels  of  parallelism  can  be  exploited.  These  levels  of  parallelism  corre¬ 
spond  to  the  three  nested  loops  of  the  progressive  radiosity  algorithm  as  given 
above. 

The  first  level  consists  in  letting  several  patches  to  emit  (or  to  shoot)  their 
energy  in  parallel.  Each  processor  is  in  charge  of  an  emitter  patch,  and  computes 
the  form-factors  between  this  patch  and  the  others.  Three  cases  can  thus  be 
considered.  The  first  case  is  when  all  the  processors  are  able  to  access  all  the 
patches’  radiosities.  In  this  case,  each  processor  shoots  the  energy  of  its  emitter 
patch  and  selects  the  next  emitter  patch.  The  second  case  occurs  when  each 
processor  manages  only  a  subset  of  radiosities.  The  form-factors  computed  by 
a  processor  are  then  sent  to  the  other  processors  which  update,  in  parallel, 
their  own  radiosities.  The  last  case  is  when  only  one  processor  manages  all  the 
radiosities.  The  other  ones  take  care  of  the  calculation  of  a  vector  of  form-factors 
(column  of  the  linear  system  matrix)  and  send  this  vector  to  the  master  processor 
which  updates  all  the  radiosities  and  selects  the  next  emitter  patches.  Note  that 
such  parallelisation  changes  the  semantic  of  the  sequential  algorithm  since  the 
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emission  order  of  patches  varies  with  the  number  of  processors  available  in  the 
parallel  machine.  In  addition,  the  selection  of  the  emitter  patches  requires  the 
access  to  all  the  patches’  radiosities  (global  vision  of  the  radiosities) .  Indeed, 
each  processor  selects  the  emitter  patch  cimong  those  it  is  responsible  for,  which 
slows  down  the  convergence  of  the  progressive  radiosity  algorithm. 

The  second  level  of  parallelism  consists  in  the  computation  of  the  form-factors 
between  a  given  patch  and  the  other  ones  by  several  processors. 

The  third  level  of  parallelism  corresponds  to  the  computation  of  the  delta 
form-factors  in  parallel.  This  level  is  strongly  related  to  the  classical  projective 
techniques  and  z-buffering  used  for  form-factor  calculation. 


Architecture 

scene 

Parallelism 

References 

SMPC 

shared 

“ 

Baum  et  al.  [5] 
Renambot  et  aJ.  [35] 

DMPC 

shared 

control 

Bouatouch  et  al.  [6] 

distributed 

data 

Arnaldi  et  al.  [1] 
Varshney  et  al.[38] 
Drucker  et  al.  [17] 
Guitton  et  al,  [22] 

control 

Chalmers  et  al.[10] 

duplicated 

control 

Chen  [11] 

Recker  et  al.  [33] 
Lepretre  et  eil.  [26] 

passed 

systolic 

Purgathofer  et  al.  [32] 

Fig.  1.  Classification  of  pctrcillel  radiosity  algorithms 


3.3  Discussion 

As  shown  in  the  last  section,  parallelisation  of  the  ray-tracing  algorithm  has 
been  widely  investigated  and  efficient  strategies  are  now  identified.  Concerning 
radiosity  computation,  several  parallelisation  strategies  have  been  studied.  How¬ 
ever,  none  of  them  really  dealt  with  data  locality.  Therefore,  solving  the  radiosity 
equation  for  complex  scenes^  in  parallel  cannot  achieve  good  performance.  In  [39, 
37, 2],  several  ideas  have  been  proposed  to  deal  with  complex  environments.  They 
are  mostly  based  on  Divide  and  Conquer  strategies:  the  complex  environment  is 
subdivided  into  several  local  environments  where  the  radiosity  computation  is 
applied.  In  the  following  subsection,  we  present  a  new  parallelisation  strategy 
able  to  render  complex  scenes  using  a  progressive  radiosity  approach  as  explained 
in  section  3.2.  It  extends  the  work  published  in  [2]. 


'  scenes  that  have  more  than  one  million  of  polygons 


886 


VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


4  Enhancing  data  locality  in  a  progressive  radiosity 
algorithm 

The  main  goal  of  our  work  was  to  design  a  data  domain  decomposition  well 
suited  for  the  radiosity  computation.  Data  domain  decomposition  aims  at  divid¬ 
ing  data  structure  into  sub-domains,  which  are  associated  to  processors.  Each 
processor  performs  its  computation  on  its  own  sub-domain  and  send  computa¬ 
tions  to  other  processors  when  it  does  not  have  the  required  data.  Applied  to 
the  radiosity  computation,  our  solution  focuses  on  the  ability  to  compute  the 
radiosity  on  local  environments  instead  of  solving  the  problem  for  the  whole 
environment.  By  splitting  the  problem  into  subproblems,  using  Virtual  Interface 
and  Visibility  Masks,  our  technique  is  able  to  achieve  better  data  locality  than 
other  standard  solutions.  This  property  is  capital  when  using  either  a  mod¬ 
ern  sequential  computer  to  reduce  data  movement  in  the  memory  hierarchy  or 
a  multiprocessors  to  keep  as  low  as  possible  communication  between  proces¬ 
sors  whatever  the  communication  paradigm  is:  either  message  passing  or  shared 
memory. 


4.1  Data  domain  decomposition 


Virtual  Interface 


Fig.  2.  Virtual  Interface. 


This  section  summarizes  briefly  the  virtual  interface  that  is  the  basic  concept 
to  perform  a  data  domain  decomposition.  A  more  detailed  description  can  be 
found  in  [1].  A  virtual  interface  is  a  technique  to  split  the  environment  (the 
scene  bounding  box)  into  local  environments  where  the  radiosity  computation 
can  be  applied  independently  from  other  local  environments  (figure  2).  It  also 
addresses  the  energy  transfer  between  local  environments.  When  a  source,  lo¬ 
cated  in  a  local  environment,  has  to  distribute  its  energy  to  other  neighbouring 
local  environments,  its  geometry  and  its  emissivity  has  to  be  sent.  We  introduced 
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a  new  structure  called  the  visibility  mask,  to  the  source  (figure  3).  The  visibility 
mask  stores  in  the  source  structure  all  the  occlusions  encountered  during  the 
processing  in  each  local  environment.  With  our  virtual  interface  concept,  the 
energy  of  each  selected  patch,  called  a  source,  is  first  distributed  in  its  local  en¬ 
vironment.  Then,  its  energy  is  propagated  to  other  local  environments.  However, 
to  propagate  efficiently  the  energy  of  a  given  patch  to  another  local  environment, 
it  is  necessary  to  determine  the  visibility  of  the  patch  according  to  the  current  lo¬ 
cal  environment:  an  object  along  the  given  direction  may  hide  the  source  (figure 
3).  We  introduced  the  visibility  mask  that  is  a  sub-sampled  hemisphere  identical 
to  the  one  involved  in  the  computation  of  the  form  factors.  To  each  pixel  of 
the  hemisphere,  used  for  form  factor  computation,  corresponds  a  boolean  value 
in  the  visibility  mask.  The  visibility  mask  allows  the  distribution  of  energy  to 
local  environments  in  a  step  by  step  basis.  If  the  source  belongs  to  the  local 
environment,  a  visibility  mask  is  created,  otherwise  the  visibility  mask  already 
exists  and  will  be  updated  during  the  processing  of  the  source.  Form  factors  are 
computed  with  the  patches  belonging  to  the  local  environment  by  casting  rays 
from  the  center  of  the  source  through  the  hemisphere.  If  a  ray  hits  an  object  in 
the  local  environment,  the  corresponding  value  in  the  visibility  mask  is  set  to 
false  otherwise  there  is  no  modification.  Afterwards,  radiosities  of  local  patches 
are  updated  during  an  iteration  of  the  progressive  radiosity  algorithm.  Finally, 
the  source  and  its  visibility  mask  are  sent  to  the  neighboring  local  environments 
to  be  processed  later. 


Fig.  3.  Initialisation  of  the  visibility  mask  for  a  local  source. 


4.2  Parallel  algorithm 

The  data  domain  decomposition  technique,  as  presented  in  the  previous  sec¬ 
tion,  is  well  suited  for  distributed  memory  parallel  computers.  Indeed,  a  local 
environment  can  be  associated  with  a  processor  performing  the  local  radiosity 
computation.  Energy  transfer  between  local  environments  is  performed  using 


888 


VECPAR’98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


message  passing.  The  parallel  algorithm  running  on  each  processor  consists  in 
three  successive  steps:  an  initialisation,  a  radiosity  computation  and  a  termi¬ 
nation  detection  as  shown  in  figure  4.  These  three  steps  are  described  in  the 
following  paragraphs. 


void  ComputeNode() 

{ 

InitialisationO 
do  { 

do  {  /*  Computing  ♦/ 

Choose  source  among  region  and  network  queue; 

Shooting  the  source; 

Read  all  the  sources  arrived  from  network,  put  them  in  queue; 
}  while(!  local.convergence); 

TerminationDetection(done) 

}  while  (  not  done); 


Fig.  4.  Algorithm  running  on  a  processor. 


Each  processor  performs  an  initialisation  step  which  consists  in  reading  the 
local  environment  geometries  that  have  been  assigned  to  it.  Once  each  proces¬ 
sor  has  the  description  of  its  local  environment  geometries,  it  reads  the  scene 
database  to  extract  polygons  that  belongs  to  its  local  environments.  The  last 
processing  step  consists  in  subdividing  the  local  environment  into  cells  using  a 
regular  spatial  subdivision  [18]  in  order  to  speedup  the  ray-tracing  process  when 
form  factors  are  computed. 

Once  the  initialisation  has  been  performed,  each  processor  selects  an  emitter 
patch  which  has  the  greater  delta-radiosity  from  either  its  local  environment 
or  from  a  receive  queue  which  contains  patches  and  visibility  masks  that  have 
been  sent  by  other  processors  since  the  last  patch  selection.  Each  selected  patch 
is  associated  with  a  visibility  mask  that  represents  the  distribution  of  energy. 
Initially,  when  a  processor  selects  a  patch  that  belongs  to  its  local  environment, 
the  visibility  mask  is  set  to  true.  As  the  form-factor  calculation  progress,  using 
a  ray-tracing  technique,  part  of  the  visibility  masks  is  set  to  false  if  the  rays  hit 
an  object  in  the  local  environment.  Once  the  energy  has  been  shot  in  the  local 
environment,  it  is  necessary  to  determine  if  there  is  still  some  energy  to  be  sent  to 
the  neighbouring  local  environments.  A  copy  of  the  visibility  mask  is  performed 
for  each  neighbouring  local  environment.  A  ray-tracing  is  then  performed  for 
all  pixels  which  have  a  true  value.  Intersection  with  the  considered  plane  is 
performed  to  determine  if  there  is  energy  entering  in  the  neighbouring  local 
environment.  If  a  ray  hits  the  plane  that  separates  the  two  local  environments, 
the  corresponding  pixel  of  the  visibility  mask  is  left  unchanged  otherwise  it  is  set 
to  false.  At  the  end  of  this  process,  if  the  copies  of  the  visibility  mask  have  still 
some  pixel  values  to  true,  they  are  sent  to  processors  which  own  the  neighbouring 
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local  environments  together  with  some  information  related  to  the  geometry  and 
the  photometry  of  the  emitter  patch. 

Each  processor  performs  its  computation  independently  from  the  other  pro¬ 
cessors.  Therefore,  it  has  no  knowledge  about  the  termination  of  the  other  com¬ 
pute  nodes.  Termination  detection  is  carried  out  using  two  steps.  The  first  step 
consist  in  deciding  to  stop  the  selection  of  a  local  patch  if  its  energy  is  under  a 
given  threshold.  The  service  node  is  in  charge  of  collecting,  on  a  regular  basis,  the 
sum  of  the  delta  radiosities  of  the  local  environments  as  well  as  the  energy  that  is 
under  transfer  between  compute  nodes.  Depending  of  this  information,  the  ser¬ 
vice  node  can  inform  compute  nodes  to  stop  the  selection  of  new  local  patches. 
Once  the  first  step  is  reached,  buffers  that  contain  sources  coming  from  other 
compute  nodes  have  to  be  processed.  To  detect  that  all  the  buffers  are  empty, 
a  termination  detection  is  carried  out  using  a  distributed  algorithm  based  on 
the  circulation  of  a  token  [15].  Compute  nodes  are  organized  as  a  ring.  A  token, 
whose  value  is  initially  true,  is  sent  through  the  ring.  If  a  compute  node  has  sent 
a  patch  to  another  node  in  the  ring  since  the  last  visit,  the  value  of  the  token 
is  changed  to  false.  It  is  then  communicated  to  the  next  compute  node.  If  the 
compute  node,  that  initiates  the  sending  of  the  token,  received  it  later  with  a 
true  value,  it  broadcasts  a  message  to  inform  all  the  compute  nodes  that  the 
radiosity  computation  has  ended. 

4.3  Results 

Experiments  were  performed  using  an  56  processors  Intel  Paragon  XP/S  using 
three  different  scenes.  The  Office  scene  represents  an  office  with  tables,  chairs 
and  shelves.  The  scene  contains  two  lights  on  the  ceiling.  It’s  an  open  scene  with 
few  occlusions.  It  is  made  of  roughly  six  hundred  polygons.  After  meshing,  7440 
patches  were  obtained.  The  second  scene  is  a  set  of  32  similar  rooms.  Four  tables, 
four  doors  open  onto  next  rooms  and  one  light  source  compose  a  room.  This  is 
a  symmetrical  scene  with  many  occlusions.  This  file  comes  from  benchmarks 
scenes  presented  at  the  5‘*  Eurographics  workshop  on  rendering.  After  meshing 
17280  patches  were  obtained.  The  last  scene  is  the  biggest  one.  It  represents  five 
floors  of  a  building  without  any  furniture  and  with  one  thousand  light  sources. 
This  scene  represents  the  Soda  Hall  Berkeley  Computer  Science  building  [19]. 
After  meshing,  71545  patches  were  generated. 

Since  the  computing  time  of  the  sequential  version  depends  on  the  number  of 
local  environments,  we  took  a  decomposition  of  the  scene  into  56  local  environ¬ 
ments  in  order  to  avoid  super-linear  speedup.  As  said  previously,  decomposition 
is  a  straightforward  automatic  process  without  optimisation  to  balance  the  load 
among  the  processors.  Despite  that  no  effort  was  spent  to  solve  the  load  bal¬ 
ancing  problem,  speedups  were  quite  good  comparing  to  other  parallelisation 
strategies  previously  published.  We  got  a  speedup  of  11  for  the  Office  scene,  40 
for  the  Rooms  scene  and  24  for  the  Building  scene  when  using  56  processors. 

Unfortunately,  the  Paragon  XP/S  we  used  for  our  first  experiments  did  not 
have  enough  memory  for  handling  complex  scenes.  Moreover,  due  to  the  lack  of 
hardware  monitoring  available  within  the  processor,  it  was  impossible  to  study 
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the  impact  of  our  technique  on  the  memory  hierarchy.  We  performed  several 
experiments  using  a  32  processors  Silicon  Graphics  Origin  2000  having  4  Gbytes 
of  physical  memory.  A  complete  description  of  this  work  has  been  published 
in  [35].  Although  this  machine  provides  a  global  address  space,  we  did  not  use 
it  to  access  data.  It  has  been  used  to  emulate  a  message  passing  mechanism 
to  exchange  data  between  processors.  Therefore,  the  parallel  algorithm  is  quite 
similar  to  the  one  described  in  the  previous  section.  We  used  two  new  scenes 
that  have  a  larger  number  of  polygons.  The  first,  named  Csb,  represents  the 
Soda  Hall  Building.  The  five  floors  are  made  of  many  furnished  rooms,  resulting 
in  a  scene  of  over  400.000  polygons.  It’s  an  occluded  scene.  The  second  scene, 
named  Parking,  represents  an  underground  car  park  with  accurate  cars  models. 
The  scene  is  over  1.000.000  polygons.  It  is  a  regular  and  open  scene.  We  use 
a  straightforward  decomposition  algorithm  that  places  virtual  interfaces  evenly 
along  each  axis. 


The  first  experiment  we  did  concerns  the  study  of  the  impact  of  our  technique 
to  exploit  efficiently  the  memory  hiersirchy  when  using  one  processor.  We  ran 
our  algorithm  using  one  processor.  For  that  purpose,  we  designed  a  sequential 
algorithm  able  to  process  several  local  environments  instead  of  only  one  [35].  For 
the  two  considered  scenes,  we  subdivided  the  initial  environment  into  100  local 
environments  for  the  Parking  scene  and  125  local  environments  for  the  Csb  scene. 
A  gain  factor  of  4.2  on  the  execution  times  can  be  achieved  for  the  Parking  scene 
with  100  sub-environments,  and  5.5  for  the  Csb  scene.  The  main  gain  is  given 
by  a  reduction  of  memory  overhead  due  to  a  dramatic  reduction  of  secondary 
data  cache  access  time  up  to  a  factor  of  30  for  the  Parking  scene  and  a  factor 
of  11  for  the  Csb  scene.  With  the  reduction  of  the  working  set,  we  enhance  data 
locality  and  make  a  better  use  of  the  L2  cache.  Data  locality  reduces  memory 
latency  and  allows  the  processor  to  issue  more  instructions  per  cycle,  which  is  a 
great  challenge  on  a  superscalar  processor.  The  overall  performance  goes  from 
10  Mflops  to  28  Mflops. 


The  last  experiment  was  performed  using  different  number  of  processors.  For 
this  experiment,  we  subdivided  the  environment  into  32  and  96  local  environ¬ 
ments.  Since  the  number  of  local  environments  has  an  impact  of  the  sequential 
time,  as  shown  in  the  previous  paragraph,  we  decided  to  compute  the  speedup 
using  the  sequential  times  obtained  with  the  same  number  of  local  environments. 
This  protocol  aims  at  avoiding  super-linear  speedup  due  to  the  memory  hierar¬ 
chy.  Using  32  processors,  we  obtained  a  speedup  of  12  for  the  Parking  and  the 
Csb  scenes  when  using  32  local  environments.  By  increasing  the  number  of  local 
environments  to  96,  we  increased  the  speedup  to  21  for  the  Parking  scene  and 
14  for  the  Scene.  This  increasing  performance  is  mainly  due  to  a  better  load 
balance  between  processors.  Since  there  are  several  local  environments  assigned 
to  a  processor,  a  cyclic  distribution  of  the  local  environments  to  the  processors 
balance  evenly  the  load. 
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5  Conclusion 

As  shown  in  this  paper,  parallelisation  of  rendering  algorithms  have  been  largely 
investigated  for  now  more  than  ten  years.  Proposed  solutions  are  both  complex 
and  often  not  independent  from  the  underlying  architecture.  Even  if  such  solu¬ 
tions  have  proven  their  efficiencies,  parallel  rendering  algorithms  are  not  widely 
used  in  production,  especially  for  the  production  of  movies  based  on  image  syn¬ 
thesis  techniques.  Such  fact  can  be  explained  as  follow.  A  movie  is  a  set  of  image 
frames  that  can  be  computed  in  parallel.  Such  trivial  exploitation  of  parallelism 
can  be  illustrated  by  the  production  of  the  Toy  Story  movie,  which  was  made 
entirely  with  computer  generated  images.  The  making  of  such  movie  was  per¬ 
formed  using  a  network  of  dozens  of  workstations.  Each  of  them  was  in  charge 
of  rendering  one  image  frame  that  took  an  average  of  1.23  hours.  Therefore,  one 
has  to  raise  the  following  question:  why  do  we  have  still  to  contribute  to  new 
parallelisation  techniques  for  rendering  algorithms?  Three  answers  can  be  given 
to  this  cumbersome  question.  The  first  one  is  obvious,  there  are  still  a  need  to 
produce  an  image  in  the  shortest  time  (for  lighting  simulation  application).  The 
second  answer  comes  from  the  fact  that  both  radiosity  and  ray-tracing  tech¬ 
niques  can  be  applied  to  other  kinds  of  applications  (wave  propagation,  sound 
simulation,  ...).  For  such  applications,  it  is  an  iterative  process  that  consists  in 
analysing  the  results  of  a  simulation  before  starting  another  simulation  with 
different  parameters.  In  such  cases,  parallelisation  of  ray-tracing  and  radiosity 
algorithms  is  required  to  speedup  the  whole  simulation  process.  The  last  answer 
comes  from  the  resource  management  problem.  Image  frame  parallelism  is  the 
most  efficient  technique  if  the  geometric  (objects)  and  photometric  (textures) 
databases  fit  in  memory.  If  such  databases  exceed  the  size  of  the  physical  mem¬ 
ory,  frequent  accesses  to  the  disk  will  slow  down  the  computation  times.  By 
using  several  computers  to  put  their  resources  (both  processors  and  memory) 
together,  a  single  image  can  be  computed  more  efficiently.  The  challenge  for  the 
next  decade  will  be,  no  doubt,  the  design  of  new  parallel  rendering  algorithms 
capable  of  rendering  large  complex  scenes  having  several  millions  of  objects. 
Parallel  computers  have  not  always  to  be  seen  as  a  mean  to  reduce  computing 
times  but  also  as  a  mean  to  compute  larger  problem  size  which  cannot  afford  by 
sequential  computer. 
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Abstract 


We  propose  a  numerical  model  of  snow  particles  transported  by 
wind  and  deposited  around  an  obstacle.  We  adopted  a  lattice  Boltz¬ 
mann  approach  to  model  the  fluid  (wind)  with  a  BGK  subgrid  tech¬ 
nique  and  we  add  solid  particles  (snow)  moving  on  the  same  lattice. 
This  problem  shows  how  lattice  methods  (where  only  the  essential 
microscopic  rules  of  a  phenomenon  are  extracted)  can  solve  complex 
situations  where  classical  numerical  models  (CFD  codes  with  mov¬ 
ing  boundaries,  erosion/deposition — )  may  be  inadequate.  Moreover, 
parallel  computers  are  naturally  suited  to  this  class  of  problems  and 
their  architecture  is  a  powerful  investigation  tool.  Our  2D  model  has 
been  tested  with  a  wide  range  of  field  experiments. 
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1  Introduction 

Snow  transport  by  wind  strongly  influences  many  human  activities.  Examples 
are  given  by  a  pass  road  buried  under  a  snowdrift  or  the  creation  of  wind  slabs 
and  cornices  which  dramatically  increase  the  avalanche  danger  above  a  road  or 
a  ski  trail.  To  face  this  problem,  one  can  adopt  an  active  strategy:  Building 
obstacles  influences  the  deposition  pattern  and,  when  located  windward  a  road, 
the  obstacle  may  store  snow  and  reduce  the  drift.  Therefore,  as  the  placing  and 
shape  of  such  obstacles  are  not  trivial,  the  field  expert  could  be  helped  by  results 
from  a  numerical  model. 

Predicting  how  snow  deposits  and  gets  eroded  under  the  action  of  wind  re¬ 
quires  in  principle  to  solve  turbulent  Navier-Stokes  equation  with  particles  in  sus¬ 
pension  and  dynamically  changing  boundary  conditions.  This  is  hardly  tractable. 
We  propose  a  new  numerical  approach  to  simulate  snow  deposition,  free  of  the 
above  complications  and  based  on  the  lattice  Boltzmann  method.  We  model  the 
phenomena  at  a  “microscopic”  level  of  description  and  consider  very  intuitive 
basic  mechanisms.  Our  approach  is  tested  with  a  wide  range  of  situations  and 
provides  an  unified  view  of  the  various  processes  involved.  The  same  model  could 
also  be  extended  to  simulate  sand  dunes  formation  or  sedimentation  problems. 

Snow  transport  by  wind  is  still  a  domain  where  little  understanding  has  been 
achieved.  Different  patterns  of  deposit  occur  when  looking  at  different  scales 
(from  the  very  small  ripples  (a  few  centimeters)  up  to  15  meters  high  cornices 
in  high  mountains).  This  phenomenon  are  a  rough  test  for  a  numerical  model. 

Phenomenologically,  snow  transport  (i.e  erosion  and  deposition)  has  been 
divided  in  three  main  processes,  each  corresponding  to  a  different  scale: 

-  Creeping:  particles  are  “rolling”  on  the  surface  or  making  very  little  jumps. 

-  Saltation:  in  the  first  half  meter  above  the  surface,  snow  particles  have  been 
observed  to  be  ejected  vertically  and  follow  a  ballistic  trajectory  [9]. 

-  Suspension:  it  accounts  for  transport  over  larger  scales  (often  seen  as  white 
smoke  by  mountains  crests) . 

Dealing  with  microscopic  interactions,  the  cellular  automata  approach  proposes 
a  unified  view  of  such  a  complex  phenomena.  We  will  present  here  the  mechanism 
of  our  model  and  compared  its  results  with  field  experiments. 


2  Modeling  reality 

To  predict  snow  transport  and  model  the  snow  erosion/deposition  areas,  both 
wind  tunnel  experiments  (with  fairly  good  accuracy  over  a  mountain  pass  area 
[1])  and  numerical  computations  have  been  investigated.  The  numerical  approach 
can  be  split  in  two  kinds:  statistical  methods  based  on  comparisons  with  recorded 
data  [17]  (they  can  be  accurate  on  geometrically  simple  situations),  and  direct 
techniques  based  on  standard  computational  fluid  dynamics  (CFD)  [16], 

The  latter  approach  is  technically  quite  difficult:  to  compute  the  wind  pat¬ 
tern,  one  has  to  solve  the  turbulent  Navier-Stokes  equation  with  dynamically 


896 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


changing  boundary  conditions  (to  account  for  the  evolution  of  the  deposition 
layer).  Creeping,  saltation  and  suspension  of  snow  are  included  separately  [12] 
through  dedicated  equations  and  most  of  the  time,  due  to  the  complexity  of  the 
model,  one  or  two  of  these  processes  are  neglected. 

In  this  report  we  propose  a  new  and  radically  different  numerical  approach 
to  simulating  snow  deposition,  free  of  the  above  complications  and  based  on  a 
cellular  automata  and  lattice  Boltzmann  method.  Instead  of  solving  a  differential 
equation,  we  propose  a  model  of  the  phenomena.  We  consider  a  “microscopic” 
description,  involving  snow  and  wind  “particles”  and  the  essential  basic  interac¬ 
tions  between  them. 

The  cellular  automaton  approach  exploits  the  fact  that  several  levels  of  re¬ 
ality  exist  in  physics  [8].  On  the  one  hand,  there  is  the  macroscopic  level  where 
phenomena  are  expressed  in  terms  of  rather  abstract  mathematical  objects  such 
as  differential  equations.  On  the  other  hand,  there  is  the  microscopic  level  of  de¬ 
scription  where  the  interactions  between  the  basic  constituents  are  considered. 
An  important  results  of  statistical  mechanics  is  that  the  macroscopic  level  of 
description  depends  very  little  on  the  details  of  the  microscopic  interactions. 
Rather,  it  depends  on  conservation  laws  and  symmetries  (most  fluids  obey  the 
same  equations  of  motion,  though  the  molecular  interactions  differ).  One  can 
use  this  property  to  build  a  fictitious  universe  in  which  the  microscopic  interac¬ 
tions  are  particularly  simple  to  simulate  on  a  computer  and  whose  macroscopic 
behavior  is  identical  to  that  of  the  real  system  [6].  Moreover,  since  the  evolution 
rule  is  local  and  identical  for  every  lattice  cell,  massively  parallel  computers  are 
naturally  designed  to  run  such  models. 


3  The  cellular  automata  approach 

The  first  major  step  in  this  direction  was  the  so-called  FHP  lattice  gas  model, 
a  two-dimensional  cellular  automata  fluid  proposed  by  Frisch,  Hasslacher  and 
Pommeau  [5].  As  shown  in  figure  1,  the  fluid  is  modeled  as  a  large  population  of 
discrete  particles  moving  synchronously,  according  to  discrete  time  steps,  along 
the  links  of  a  regular  lattice  and  changing  their  direction  when  bouncing  into 
each  other.  Cellular  automata  fluids  are  typically  described  by  occupation  num¬ 
bers  ni(r.t)  £  {0,1}  indicating  the  absence  or  presence  of  a  particle  entering 
site  r  at  time  t,  with  a  unit  velocity  Vi  pointing  along  lattice  direction  i  (for  in¬ 
stance,  directions  are  labeled  counterclockwise,  0  designates  East).  The  equation 
of  motion  reads 


?7,(r  -1-  TiuJ  +  t)  =  n,(r,f)  +  Qi(n(r,l))  (1) 

where  r  is  the  time  step  and  the  collision  term  defined  as  a  nonlinear  com¬ 
bination  of  the  Hi  and  expressing  the  balance  of  particle  in  direction  ?'  after 
the  collision  takes  place.  The  collision  rule  is  tailored  so  that  mass  (defined 
as  ^i)  momentum  (^^  njU,)  are  locally  conserved  by  the  interaction. 
Some  lattice  sites  can  be  turned  into  solid  (representing  either  deposited  snow 
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lattice  edge  4  fluid  particle  ^  solid  site 

Fig.  1.  Evolution  of  fluid  particles  on  a  hexagonal  lattice.  The  first  image  shows  incom¬ 
ing  particles  at  time  t,  on  each  sites,  the  second  one  the  distribution  after  the  collision 
step,  and  the  third  one  their  new  positions  at  time  t  -f  r.  The  lattice  displayed  here 
is  hexagonal  (as  in  the  FHP  model),  but  the  one  used  for  our  simulations  is  square, 
where  each  cell  has  eight  neighbors  (E,  NE,  N,  NW. , . ). 


or  ground).  Incoming  particles  on  solid  sites  will  bounce-back,  thus  modeling  a 
no-slip  boundary  condition. 

These  types  of  models,  despite  their  artifacts  and  limitations,  capture  the 
main  essence  of  fluid  dynamics,  in  the  sense  that  the  local  average  density  p  and 
velocity  field  u: 


p=<'^ni>  u  =  (l/p)  <  ^ riiVi  > 


(2) 


obey,  within  certain  limits,  the  Navier-Stokes  equation  with  a  built-in  viscos¬ 
ity  and  pressure  term.  As  explained  in  section  5,  these  algorithm  are  also  very 
efficiently  implemented  on  massively  parallel  computers,  thus  allowing  an  inter¬ 
active  modeling. 

Much  progress  has  been  made  simulating  three  dimensional  fluids  and  reme¬ 
dying  some  of  the  earlier  deficiencies  {e.g.  spurious  invariants,  statistical  noise). 
A  key  improvement  was  to  simulate  the  probability  of  a  particle's  presence 
/,  =<  rii  >  of  a  particle  rather  than  the  particle  itself.  This  gives  much  more 
flexibility  to  define  the  collision  rule.  The  so-called  lattice  BGK  models  [13]  have 
no  statistical  noise  and,  above  all,  contain  a  free  relaxation  parameter  Tr  >  1/2 
to  tune  the  viscosity  =  (l/6)[2rr  —  1]  (a  review  of  this  evolution  is  summarized 
by  Qian  [14]).  The  dynamics  are  similar  to  the  FHP  model  and  become 

fi{r  +  TViJ.  +  t)  -  fi{r,t)  =  ^  -  /,j  (3) 
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where  is  the  local  equilibrium  distribution  depending  on  the  local  velocity 
field  u  and  particle  density  p: 

ft-i  .  . 

=  OiP  +  —pVi  ■  U-\-  pCi-r  +  p-jViaVipUaUp  (4) 

To  lower  the  viscosity  of  the  fluid,  and  therefore  increase  the  Reynolds  num¬ 
ber  of  the  flow,  one  can  either  decrease  r  toward  1/2  or  increase  the  resolution  of 
the  lattice  (at  the  cost  of  computer  CPU  time  and  memory).  The  first  technique 
leads  to  numerical  instabilities  and  the  second  one  is  limited  by  the  power  of 
available  computers.  To  cope  with  such  problems,  a  subgrid  approach  can  be 
applied  [7]  following  the  large  eddy  simulations  (LES)  ideas:  it  takes  into  ac¬ 
count  eddies  at  unresolved  scales  (as  the  lattice  spacing  is  finite)  through  their 
influence  on  larger  ones.  This  influence  can  be  included  by  adjusting  dynami¬ 
cally  and  locally  the  relaxation  time  Tr  according  to  the  magnitude  of  the  local 
strain-rate  tensor 

Tr  =  To  +  C[dpUa  -f  daUp]  (5) 

where  C  is  a  model  constant  and  tq  the  equilibrium  relaxation  time.  The  indices 
Q  and  /?  label  spatial  coordinates  and  summations  over  repeated  indices  are  a.s- 
sumed.  Having  such  an  effective  relaxation  time  (and  thus  an  effective  viscosity) 
allows  one  to  simulate  high  Reynolds  number  flows  (~  10®)  by  tuning  C  [7].  The 
effect  of  C  is  to  adjust  the  resolution  scale  of  the  flow.  A  small  C  is  appropriate 
to  describe  small  developed  eddies.  Thus,  in  order  to  produce  the  correct  loga¬ 
rithm  velocity  profile  [19],  which  requires  a  finer  resolution  near  the  ground,  C 
is  considered  as  a  function  of  the  height  [llj.  It  decreases  linearly  from  a  value 
Ccc  to  0  as  one  approaches  the  ground  level. 

The  present  model  is  far  more  elaborate  than  the  original  FHP  but  it  keeps 
its  main  advantages:  one  deals  directly  with  fluid  “particles,  ’  the  algorithm  is 
simple,  local  and  runs  well  on  massively  parallel  computers. 

4  Adding  snow  particles 

Virtual  snow  particles  can  be  added  on  top  of  such  a  lattice  BGK  model  for 
wind  with  simple  erosion  and  deposition  mechanisms.  Snow  particles  can  either 
be  injected  into  the  simulation  (snow  fall)  or  eroded  from  the  ground,  then 
deposited  and  transported  according  to  the  combined  effect  of  gravity'  and  wind 
direction. 

It  would  be  far  too  ambitious  (and  against  our  working  hypotheses)  to  in¬ 
clude  all  the  complication  of  the  erosion/transport  phenomenon  (exact  number 
of  snow  particles,  their  geometrical  structures,  the  correct  inter-particle  cohesion, 
and  the  evolution  of  the  chemical  properties  and  the  temperature,  etc.).  Follow¬ 
ing  the  philosophy  of  statistical  mechanics  and  the  cellular  automata  approach, 
we  restrict  ourselves  to  what  we  identify  as  the  most  significant  processes  and 
express  them  in  simple  rules.  The  macroscopic  behavior  will  be  captured  pro¬ 
vided  that  these  ingredients  are  indeed  the  relevant  ones  and  that  enough  scale 
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separation  exist  between  the  different  levels  of  description  [2].  For  the  time  be¬ 
ing,  no  mechanism  is  included  to  compute  the  strain  inside  the  snow  deposit 
and,  therefore,  no  realistic  cornices  can  be  modeled.  The  rules  we  consider  are; 

4.1  Transport 

An  arbitrary  number  of  snow  grains  may  reside  at  each  lattice  site.  During  the 
updating  step,  flakes  are  synchronously  spread  to  the  nearest  neighboring  sites 
following  the  wind  velocity  field  u  and  the  gravity  (through  the  falling  speed 
ujall).  As  a  solid  particle  moving  from  r  to  r-f  r(«-|-u/a(i)  would  not  remain  on 
the  (square)  lattice,  we  define  a  randomized  algorithm  where  only  the  averaged 
trajectories  are  exact: 


r  -I-  Tti3  r  -I-  t,t'2 


In  the  case  above,  the  grain  would  move  eastwards  with  probability  Peasi  = 
?77f7i(l,  Uj./|t’o|)  and  northwards  with  Pnorth  =  mm(l,  Uj,/|u2|)-Finally  a  particle 
will  jump  to  neighbor  1  [i.e.  r  -b  t»i)  with  probability  pi  =  PeastPnorth,  to  0 
with  Po  =  Peastil  -  P,iorth)  and  to  2  with  p2  =  Pnorth{l  -  Peast)\  it  will  therefore 
remain  on  r  with  probability  pre,t  =  (1  -peaj.f)(l -Pnorth)-  If  N  =Y^  Ni  is  large 
enough,  one  can  approximate  such  a  binomial  scattering  (among  the  neighbors 
0, 1  and  2)  with  a  Gaussian  random  distribution. 

This  algorithm  obviously  ensures  that  the  average  motion  satisfies  the  local 
wind  speed  and  gravity  (i.e.  the  falling  speed).  Moreover,  no  effort  is  made  to 
split  the  transport  into  creeping,  saltation  or  suspension  since  a  particle  is  not 
aware  of  its  elevation  above  the  snow  ground  level. 

4.2  Deposition 

Lattice  sites  can  be  either  solid  (original  landscape  or  deposited  snow)  or  free 
(air).  A  free  site  may  become  solid  if  enough  virtual  particles  land  on  it.  Snow 
particles  on  a  free  site  may  “freeze”  if  the  neighboring  site  to  which  they  want 
to  jump  is  a  solid  site.  When  the  number  of  frozen  particles  on  a  lattice  site 
exceeds  some  (empirically)  pre-assigned  threshold,  it  becomes  solid  and  subse¬ 
quent  incoming  wind  particles  will  bounce  back.  This  threshold  gives  a  way  to 
a,ssign  some  size  to  the  snow  particle,  and  must  be  calibrated  in  regard  to  the 
real  size  of  a  simulation  cell.  To  model  small  phenomena,  such  as  ripples  (fig.  4), 
this  threshold  is  lower  (w  10  part.)  than  for  the  deposit  around  a  mountain  crest 
(w  100  part.,  fig,  2).  When  a  site  solidifies  the  wind  particles  that  may  be  present 
get  trapped  until  erosion  frees  them  again. 
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4.3  Ei’osion 

Deposited  particles  may  be  eroded  under  certain  conditions.  The  erosion  rate  is 
not  trivial.  It  is  related  to  the  wind  speed  above  the  solid  site,  the  concentration 
of  snow  being  transported,  the  saturation  concentration,  and  the  efficiency  of  the 
transport  [18].  In  our  model,  erosion  means  that  snow  particles  are  ejected,  with 
a  (low)  probability  Perod<  toward  the  upper  neighboring  site.  Then,  following 
the  transport  algorithm,  such  a  particle  will  fly  away  if  the  wind  velocity  field  is 
locally  strong  enough  or  fall  back  to  its  initial  location.  The  quantity  Perod  cannot 
be  easily  related  to  physical  parameters,  so  it  must  be  empirically  calibrated:  on 
the  one  hand,  if  Perod  is  too  high,  no  particle  will  remain  at  the  same  place  for  a 
long  enough  time  and  therefore  no  deposit  can  be  built,  but  on  the  other  hand, 
if  it  is  too  low,  particles  won’t  take  off  and  nothing  will  happen  before  a  long 
simulation  time.  This  erosion  rule  seems  trivial  in  view  of  the  complexity  of  the 
problem,  but  the  generation  of  ripples  patterns  at  small  space-scale  shows  that 
the  rule  may  have  caught  a  major  point  of  the  real  erosion  phenomenon. 

5  Implementation  and  Performance 

Both  wind  (equations  (2)  to  (5))  and  particle  evolution  computations  (sec¬ 
tions  4.1  to  4.3)  are  local  and  need  regular  communications  (a  cell  communicates 
only  with  its  nearest  neighbors).  The  intrinsic  simplicity  of  the  model  leads  to  a 
much  simpler  and  compact  programming  (in  CM  Fortran,  an  early  mix  of  For¬ 
tran  77  and  High  Performance  Fortran  dedicated  to  SIMD)  compared  to  classic 
heavy  CFD  codes,  solving  the  Navier  Stokes  differential  equations. 

As  massively  parallel  computers  are  very  well  suited  to  lattice  Boltzmann 
models,  all  our  2D  snow  deposits  results  have  been  performed  on  a  Connexion 
Machine  CM-200  with  8192  1-bit  processors  and  256  FPU.  Some  3D  fluid  simula¬ 
tions  have  been  achieved  on  an  IBM  SP2  (14  RS6000  processors,  communicating 
through  MPI). 


5.1  Implementation 

The  simulation  domain  is  mapped  on  a  square  lattice,  thus  2D  arrays  natu¬ 
rally  offer  an  efficient  data  structure.  On  every  cell,  we  need  (i)  9  floats  for  the 
wind  densities  (eight  neighbors  plus  a  rest  quantity),  (ii)  9  integers  for  the  snow 
particles  and  (iii)  9  bits  to  get  a  local  map  of  the  surrounding  solid  cells. 

The  program  execution  can  be  summarized  by: 

1.  Initialization 

2.  for  every  lattice  site 

2. a.  update  the  local  map  (9  bits)  of  the  surroundings  (as 
some  neighboring  cells  may  have  frozen  or  been  freed) 
2.b.  compute  the  wind  density  eind  velocity  according 
to  equations  (2) 

2.C.  compute  the  snow  particles  evolution  under 
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transport  (4.1),  deposition  (4.2)  and  erosion  (4.3) 

2.d.  a  new  wind  particles  distribution  following  eq.  (3) 

2.e.  propagate  snow  and  wind  densities  according  to  their 
direction  of  motion 

3.  loop 

To  recover  a  realistic  situation,  a  chronology  of  events  (change  of  some  pa¬ 
rameters  at  given  time  steps)  must  be  pre-defined.  By  example,  one  should  (i) 
let  the  wind  get  established  over  the  simulation  domain,  (ii)  introduce  snow  par¬ 
ticles  and  (iii)  in  some  simulations,  stop  the  particles  injection  and  wait  for  a 
steady  state  deposit. 

Moreover,  to  model  snow  transport  by  wind,  many  scenarios  have  had  to 
be  tested,  leading  to  the  fastidious  exploration  of  a  very  large  parameter  space. 
Fortunately,  short  computing  times  (below  one  hour)  allow  the  use  of  an  inter¬ 
active  user-friendly  tool,  and  the  on-line  tuning  of  the  parameters  in  order  to 
investigate  relevant  simulations. 


5.2  Performances 

A  standard  performance  measure  in  CA  simulations  is  the  number  of  site  updates 
per  second:  this  number  is  roughly  8  x  10'*  sites  per  second,  on  a  256  FPU  (8192 
integer  processors)  CM-200.  A  scalability  analysis  (speed-up  and  efficiency)  is 
difficult  on  such  a  machine  a.s  the  number  of  processors  is  fixed. 

A  usual  domain  size  is  256  x  64,  and  communications  between  processors  are 
negligible  compared  to  the  computation  time. 

The  massive  parallelism  is  very  well  suited  to  the  fluid  updating  step  a.s 
almost  every  cell  (except  the  solidified  ones)  must  be  computed.  As  only  a  few 
lattice  sites  contain  solid  particles,  the  efficiency  is  less  optimal  for  the  snow 
step.  This  problem  can  be  solved  using  a  SPMD  (with  MPI  or  PVM)  machine, 
where  only  the  required  computations  are  executed  and  the  load  can  be  explicitly- 
balanced  among  the  processors. 

6  Results 

Since  no  first  principle  theory  is  available  for  snow  erosion  and  deposition,  the 
only  and  most  convincing  way  to  assess  the  validity  of  our  model  is  to  compare 
its  predictions  with  field  observations  which  can  be  found  in  literature.  Our 
simulations  have  been  tested  with  a  wide  range  of  situations  and  have  been 
shown  to  catch  some  realistic  phenomena  at  very  different  spatial  scales. 

The  fluid  and  the  snow  are  modeled  by  virtual  particles.  Each  virtual  solid 
particles  is  not  aimed  at  representing  one  real  snow  flake,  neither  one  fluid  parti¬ 
cle  is  aimed  at  modeling  one  air  molecule.  The  lattice  automata,  approach  relies 
on  the  complex  collective  behavior  emerging  when  many  particles  with  simple 
rules  interact.  Therefore,  the  correspondence  between  our  ‘'particle  world"  and 
real  life  situations  is  not  straightforward.  We  shall  thus  proceed  in  four  steps. 


902 


VECPAR  '98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


The  first  step  is  to  assign  a  size  to  a  lattice  cell;  since  the  number  of  cells  is 
mainly  limited  by  the  available  computer  memory  and  CPU  resources  (a  typical 
lattice  is  512  x  64),  the  cell  size  is  constrained  by  the  scale  of  the  actual  problem 
(from  3  cm  for  the  trench  (fig.  3)  or  the  ripples  (fig.  4)  up  to  1.5  m  for  the  crest 
deposits  (fig.  2)).  The  second  step  is  to  get  a  correct  Reynolds  number  (mainly 
through  the  calibration  of  Coo  and  the  entry  speed  of  the  fluid  in  the  tunnel).  In 
the  third  step,  setting  the  threshold  number  of  deposited  frozen  particles  before 
solidification  of  the  cell  gives  a  size  to  the  virtual  snow  particle  (100  part,  for 
the  crest,  versus  only  10  for  the  ripples  or  the  trench);  note  that  the  cell  size 
and  this  threshold  are  linked.  The  fourth  step  consists  in  adjusting  Perod,  tke 
probability  of  a  particle  ejection.  The  simulation  is  completed  when  the  deposit 
reaches  either  a  steady  state  (ripples)  or  an  expected  volume  (trench  or  crest). 

The  situations  we  considered  are  detailed  in  figures  2  through  4. 

Figure  2  shows  the  snow  deposited  along  a  mountain  crest,  at  different  po¬ 
sitions,  to  compare  the  influence  of  the  landscape  profile  on  the  accumulation 
pattern.  The  quality  of  results  are  rather  good  in  the  two  upper  cases.  In  the 
last  one,  we  may  explain  the  poor  result  observed  by  a  bad  discretization  of  the 
flat  lee-slope  landscape  and,  moreover,  by  an  insufficient  re-erosion  modeling  of 
the  particles  leeward  the  crest  (where  the  wind  is  weak). 

In  figure  3  we  see  the  simulation  of  the  filling  of  a  trench  excavated  in  a 
large  flat  area.  A  qualitatively  good  agreement  is  observed  between  the  model 
and  reality,  mainly  for  the  first  part  of  the  experiment  (growth  of  two  deposition 
peaks),  before  the  wind  has  slowed  down  in  the  outdoor  experiment.  Note  that 
others  of  our  numerical  simulations  show  an  accumulation  on  the  right  hand 
corner. 

Finally,  figure  4  shows  small  scale  patterns  known  as  ripples  occurring  with 
both  sand  and  snow  transport.  Ripples  are  mainly  due  to  creeping  transport.  The 
ratio  we  find  between  the  height  and  the  spacing  of  the  oscillations  (called  the 
wave  index)  ranges  around  6  (the  mean  ripple  height  is  8  sites,  and  the  distance 
between  two  crests  around  50  sites);  this  value  agrees  with  the  lowest  index  found 
for  sand  [15]  in  field  observations,  fits  well  wind  tunnel  experiments  values  [10] 
and  sand  ripples  in  water  [15].  Outdoor  snow  ripples  are  more  complicated  since 
freezing  and  cohesion  have  to  be  taken  into  account;  their  wave  index  has  been 
measured  to  be  around  16  [1,3].  In  agreement  with  real  observations,  we  also  see 
in  our  simulation  that  ripples  move  horizontally.  This  effect  is  illustrated  in  the 
figure.  As  observed  in  [20],  our  model  also  shows  that  large  ripples  can  be  built 
through  the  merging  of  smaller  ones  traveling  faster. 

In  conclusion,  our  model  not  only  produces  some  quantitatively  realistic  de¬ 
posits,  it  also  shows,  that  even  the  simple  and  intuitive  rules  we  have  used,  catch 
the  basic  mechanisms  that  occur  in  snow  transport.  It  shows  that  the  various 
patterns  of  deposition  result  from  the  emergence  of  a  collective  effect  rather 
than  from  mechanisms  that  have  yet  not  been  identified.  Creeping,  saltation  or 
suspension  are  no  longer  three  different  phenomena,  each  requiring  special  treat¬ 
ment,  they  are  all  captured  by  the  same  erosion/ transport  mechanisms.  Thus. 
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Nr.  1 


landscape 

field  observed  deposit 
computed  results 


Fig.  2.  Snow  deposit,  over  a  mountain  crest  (Schwarzhomgrat  near  Davos),  where  the 
wind  blows  from  left  to  right  (at  different  positions  along  the  crest).  The  upper  part 
of  each  figure  shows  the  ground  profile  and  the  field  deposit  (scale  in  meters)  b\-  [4] 
(the  crests  are  labeled  from  the  reference).  The  lower  part  shows  only  the  field  and 
the  modeled  deposits,  stretched  with  a  factor  2  in  the  z-direction.  The  same  set  of 
parameters  has  been  used  for  the  three  experiments. 
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Fig. 3.  A  trench  (0.7x1. 7m,  lattice  spacing  0.03m)  is  getting  buried  under  snowdrift. 
Field  observations  have  been  achieved  by  Kobayashi  [9]  (results  are  shown  in  the  figure 
inset). 


Fig.  4.  Formation  of  ripples,  as  obtained  from  our  model.  Particles  are  continuously 
injected  in  the  lower  left  comer  of  the  simulation  and  the  ripples  grow  spontaneously. 
The  deposition  profile  is  given  every  1000  time  steps,  which  makes  the  horizontal 
ripple  motion  quite  clear  (as  well  as  the  higher  speed  of  the  smaJler  ripples  “escaping” 
rightwards). 
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our  model  results  in  a  unified  view  of  the  basic  laws  governing  the  formation  of 
snow  deposition  patterns. 

Note  that  the  same  approach  can  be  extended  to  simulated  sand  dune  forma¬ 
tion  or  sedimentation  problems.  To  model  dry  sand  deposits,  one  could  neglect 
cohesion  and  allow  a  particle  to  deposit  only  if  it  occupies  a  stable  position  with 
respect  to  the  solid  sites  underneath. 
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Abstract.  This  paper  deals  with  the  basic  principles  of  the  new  FEM 
software  package  FEAST.  For  the  FEAST  software,  which  is  mainly  de¬ 
signed  for  high-performance  simulations,  we  explain  the  basic  principles 
of  the  underlying  numerical,  algorithmic  and  implementation  concepts. 
Computational  examples  illustrate  the  (expected)  numerical  and  compu- 
tationcil  efficiency  of  this  new  software  package,  particularly  in  relation 
to  existing  approaches. 


1  Introduction 

Current  trends  in  the  software  development  for  Partial  Differential  Equations 
(PDE’s),  and  here  in  particular  for  Finite  Element  (FEM)  approaches,  go  clearly 
towards  object-oriented  techniques  and  adaptive  methods  in  any  sense.  Hereby 
the  employed  data  and  solver  structures,  and  especially  the  matrix  structures, 
are  often  in  contradiction  to  modern  hardware  platforms.  As  a  result,  the  ob¬ 
served  computational  efficiency  is  far  away  from  expected  peak  rates  of  al¬ 
most  1  GFLOP/s  nowadays,  and  the  ’’real  life”  gap  will  even  further  increase. 
Since  high  performance  calculations  may  be  only  reached  by  explicitly  exploit¬ 
ing  ’’caching  in”  and  ’’pipelining”  in  combination  with  sequentially  stored  arrays 
(using  machine-optimized  linear  algebra  libraries  like  the  ESSL  or  PER.FLIB  for 
instance),  the  corresponding  realization  seems  to  be  ’’easier”  for  simple  Finite 
Difference  approaches.  So,  the  question  arises  how  to  perform  similar  techniques 
for  much  more  sophisticated  Finite  Element  codes? 

These  discrepancies  between  complex  mathematical  approaches  and  highly 
structured  computational  demands  often  lead  to  unreasonable  calculation  times 
for  ’’real  world”  problems,  e.g.  Computational  Fluid  Dynamics  (CFD)  calcula¬ 
tions  in  3D,  as  can  be  seen  from  recent  benchmarks  [6]  for  commercial  as  well 
as  research  codes.  Hence,  strategies  for  efficiency  enhancement  are  necessary, 
not  only  from  the  mathematical  (algorithms,  discretizations)  but  also  from  the 
software  point  of  view.  To  realize  some  of  these  necessary  improvements  our  new 
Finite  Element  package  (project  ncune:  FEAST  -  Finite  Element  Analysis  & 
Solution  Tools)  is  under  development.  This  package  is  based  on  the  following 
concepts: 

-  (recursive)  ’’Divide  and  Conquer”  strategies, 

-  hierarchical  data,  solver  and  matrix  structures, 

-  ScaRC  as  generalization  of  multigrid  and  domain  decomposition  techniques, 

-  frequent  use  of  machine-optimized  linear  algebra  routines. 
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all  typical  Finite  Element  facilities  included. 

The  result  is  going  to  be  a  flexible  software  package  with  special  emphasis  on 

—  (closer  to)  peak  performance  on  modern  processors, 

-  typical  multigrid  behaviour  w.r.t.  efficiency  and  robustness, 

—  parallelization  tools  directly  included  on  low  level, 

-  open  for  different  adaptivity  concepts, 

-  low  storage  requirements, 

—  application  to  many  ’’real  life”  problems  possible. 

Figure  1  shows  the  general  structure  of  the  FEAST  package: 


HnHtflulrifA' 

Vnumlumlimit 


Clirnt 

Inw 


Fig.  1:  FEAST  structure  and  configuration 

As  programming  language  Fortran  (77  and  90)  is  used.  The  explicit  use  of 
the  two  Fortran  dialects  arises  from  following  observations.  For  Fortran77  very 
efficient  and  well-tried  compilers  are  available  which  allow  to  exploit  much  of 
the  machine  performance.  Further  it  is  possible  to  reuse  many  reliable  parts  of 
the  predecessor  packages  FEAT2D,  FEAT3D  and  FEATFLOW  [8].  On  the  other 
hand  Fortran77  is  not  more  than  a  better  ’’macro  assembler”,  the  very  limited 
language  constructs  make  the  project  work  very  hard.  Further  F77  is  no  longer 
the  actual  standard,  so  what  is  about  the  support  in  the  future?  And  which 
developer  can  be  motivated  to  program  in  F77? 

F90  on  the  other  hand  is  the  new  standard  and  provides  new  helpful  features 
like  records,  dynamic  memory  allocation,  etc.  But  there  are  several  disadvan¬ 
tages.  The  language  is  very  overloaded  and  the  realization  of  some  features  like 
pointers  is  not  succeeded.  The  weighty  point  of  criticism  is  the  nowadays  very 
inefficient  code  generation  of  the  F90  compilers.  Programs  compiled  with  F77 
and  F90  show  a  difference  in  runtime  up  to  a  factor  of  8,  depending  on  algorithm 
and  compiler. 

The  compromise  is  to  implement  the  time  critical  routines  from  the  numerical 
linear  algebra  in  F77.  while  the  administrative  routines  like  iteration  control  is 
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based  on  F90.  If  the  F90  compilers  achieve  the  same  code  quality  as  their  F77 
pendants  it  is  no  problem  to  switch  completely  to  F90. 

The  pre-  and  postprocessing  is  mainly  handled  by  Java  based  program  parts. 
Configuring  a  high  performance  computer  cis  a  FEAST  server,  the  user  shall 
lie  able  to  perform  the  remote  calculation  by  a  FEAST  client.  Further  certain 
AVS/Express  modules  for  which  we  agreed  with  AVS  to  include  in  our  software 
package  for  free  are  provided  for  visualization. 

In  the  following  we  give  examples  for  ’’real”  computational  efficiency  results 
of  typical  numerical  tools  which  help  to  motivate  our  hierarchical  data,  solver 
and  matrix  structures.  To  understand  these  better,  we  illustrate  shortly  the  cor¬ 
responding  solution  technique  ScaRC  (Scalable  Recursive  Clustering)  in  com¬ 
bination  with  the  overall  ’’Divide  and  Conquer”  philosophy  which  is  essential  for 
FEAST.  We  discuss  how  typical  multigrid  rates  can  be  achieved  on  parallel  as 
well  as  sequential  computers  with  a  very  high  computational  efficiency. 

2  Main  Principles  in  FEAST 

2.1  Hierarchical  data,  solver  and  matrix  structures 

One  of  the  most  important  principles  in  FEAST  is  to  apply  consequently  a 
(Recursive)  Divide  and  Conquer  strategy.  The  solution  of  the  complete  ’’global” 
problem  is  recursively  split  into  smaller  ’’independent”  subproblems  on  ’’patches” 
as  part  of  the  complete  set  of  unknowns.  Thus  the  two  major  aims  in  this  splitting 
procedure  which  can  be  performed  by  hand  or  via  self-adaptive  strategies  are: 

—  Find  locally  structured  parts. 

-  Find  locally  anisotropic  parts. 

Based  on  ’’small”  structured  subdomains  on  the  lowest  level  (in  fact,  even 
one  single  or  a  small  number  of  elements  is  allowed),  the  ’’higher-level”  sub¬ 
structures  are  generated  via  clustering  of  ’’lower— level”  parts  such  that  algebraic 
or  geometric  irregularities  are  hidden  inside  the  new  ’’higher-level’'  patch.  More 
background  for  this  strategy  is  given  in  the  following  sections  which  describe  the 
corresponding  solvers  related  to  each  stage. 

Figures  2  and  3  illustrate  exemplarily  the  employed  data  structure  for  a 
(coarse)  triangulation  of  a  given  domain  and  its  recursive  partitioning  into  several 
kinds  of  substructures. 

According  to  this  decomposition,  a  corresponding  data  tree  -  the  skeleton  of 
the  partitioning  strategy  -  describes  the  hierarchical  decomposition  process.  It 
consists  of  a  specific  collection  of  elements,  macros  (Mxxx),  matrix  blocks  (MB), 
parallel  blocks  (PB),  subdomain  blocks  (SB),  etc. 

The  atomic  units  in  our  decomposition  are  the  ’’macros”  which  may  be  of 
type  structured  (as  n  x  n  collection  of  quadrilaterals  (in  2D)  with  local  Finite 
Difference  data  structures)  or  unstructured  (any  collection  of  elements,  for  in¬ 
stance  in  the  case  of  fully  adaptive  local  grid  refinement).  These  ’’macros”  (one 
or  several)  can  be  clustered  to  build  a  ’’matrix  block”  which  contains  the  ’’local 
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Fig.  2:  FEAST  domain  structure 


matrix  parts”:  only  here  is  the  complete  matrix  information  stored.  Higher-level 
constructs  are  ’’parallel  blocks”  (for  the  parallel  distribution  and  the  realiza¬ 
tion  of  the  load  balancing)  and  ’’subdomain  blocks”  (with  special  conformity 
rules  with  respect  to  grid  refinement  and  applied  discretization  spaces).  They  all 
together  build  the  complete  domain,  resp.  the  complete  set  of  unknowns.  It  is 
important  to  realize  that  each  stage  in  this  hierarchical  tree  can  act  as  indepen¬ 
dent  ’’father”  in  relation  to  its  ’’child”  substructures  while  it  is  a  ’’child”  at  the 
same  time  in  another  phase  of  the  solution  process  (inside  of  the  ScaRC  solver, 
see  later). 

2.2  Generalized  solver  strategy  ScaRC 

In  short  form  our  long-time  experience  with  the  numerical  and  computational 
runtime  behaviour  of  typical  multigrid  (MG)  and  Domain  Decomposition  (DD) 
solvers  can  be  concluded  as  follows: 


Some  observations  from  standard  multigrid  approaches:  ’While  in  fact 
the  numerical  convergence  behaviour  of  (optimized)  multigrid  is  very  satisfying 
with  respect  to  robustness  and  efficiency  requirements,  there  still  remain  some 
’’open”  problems:  often  the  parallelization  of  powerful  recursive  smoothers  (like 
SOR  or  ILU)  leads  to  performance  degradations  since  they  can  be  realized  only  in 
a  ”  blockwise”  sense.  Thus  it  is  often  not  clear  how  the  nice  numerical  behaviour 
in  sequential  codes  for  complicated  geometric  structures  or  local  anisotropies  can 
be  reached  in  parallel  computations.  And  additionally,  the  communication  over¬ 
head  especially  on  coarser  grid  levels  dominates  the  total  CPU  time.  Even  more 
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important  is  the  ’’computational  observation”  that  the  realized  performance  on 
modern  platforms  is  often  far  beyond  (sometimes  less  than  1  %)  the  expected 
peak  performance.  Many  codes  often  reach  much  less  than  10  MFLOP/s,  and 
this  on  computers  which  are  said  (by  the  vendors)  to  run  with  up  to  1  GFLOP/s 
peak.  The  reason  is  simply  that  the  single  components  in  multigrid  (smoother, 
defect  calculation,  grid  transfer)  perform  too  few  arithmetic  work  with  respect  to 
each  data  exchange  such  that  the  facilities  of  modern  superscalar  architectures 
are  poorly  exploitable.  In  contrast,  we  will  show  that  in  fact  30  -  70  %  can  be 
realistic  with  appropriate  techniques. 


Some  observations  from  standard  Domain  Decomposition  approaches: 

In  contrast  to  standard  multigrid,  the  parallel  efficiency  is  much  higher,  at  least 
as  long  as  no  large  overlap  region  between  processors  must  be  exchanged.  While 
overlapping  DD  methods  do  not  require  additional  coarse  grid  problems  (however 
the  implementation  in  3D  for  complicated  domains  or  for  complex  Finite  Element 
spaces  is  a  hard  job),  non- overlapping  DD  approaches  require  certain  coarse  grid 
problems,  as  the  BPS  preconditioner  for  instance  which  may  lead  again  to  several 
numerical  and  computational  problems,  depending  on  the  geometrical  structure 
or  the  used  discretization  spaces.  However  the  most  important  difference  between 
Domain  Decomposition  and  multigrid  are  the  (often)  much  worse  convergence 
rates  of  DD  although  at  the  same  time  more  arithmetic  work  is  done  on  each 
processor. 

As  a  conclusion  improvements  are  enforced  by  the  facts  that  the  conver¬ 
gence  behaviour  is  often  quite  sensitive  with  respect  to  (local)  geometric/ 
algebraic  anisotropies  (in  ’’real  life”  configurations),  and  that  the  performed 
arithmetic  work  (which  allows  the  high-performance)  is  often  restricted  by 
(un)necessary  data  exchanges. 

An  additional  observation  which  is  strongly  related  to  the  previous  data 
structure  in  combination  with  the  specific  hierarchical  ScaR.C  solver  is  illus¬ 
trated  in  the  following  figure.  We  show  the  resulting  ’’optimal”  mesh  from  a 
numerical  simulation  of  R.Becker/R.Rannacher  for  ’’Flow  around  the  cylinder” 
which  was  adaptively  refined  via  rigorous  a-posteriori  error  control  mechanisms 
specified  for  the  required  drag  coefficient  ([5]). 


Fig.  4:  "Optimal  grid”  via  a-posteriori  error  estimation 
As  can  be  seen  the  adaptive  grid  refinement  techniques  are  needed  only  lo¬ 
cally,  near  the  boundaries,  while  mostly  regular  substructures  (up  to  90  %)  can 
be  used  in  the  interior  of  the  domain.  This  is  a  quite  typical  result  and  shows 
that  even  for  (more  or  less)  complex  flow  simulations  (here  as  a  prototypical 
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example)  locally  blockwise  ’’Finite  Difference”  techniques  can  be  applied;  these 
regions  can  be  detected  and  exploited  by  the  given  hierarchical  strategies. 

The  ScaRC  approach  consists  of  a  separated  multigrid  approach  for  every 
hierarchical  layer,  whereby  the  multigrid  scheme  on  the  outest  layer  (subdomain 
layer)  gives  the  final  result.  The  smoothing  step  of  the  multigrid  method  is  based 
on  the  following  notation: 

Smoothing  on  level  h  for  AhX  =  bh'. 


-  global  outer  block  Jacobi  scheme  (with  averaging  operator  ‘M’) 

=x‘  -  ujgA';;^j^{Ahx‘  -  bh) 

N 

With  A-^^  :=  M  o  ■=Yl^h!i  -  ~ 

i=r 

-  ’’solve”  local  problems  Ah,iyi  =  def\  ;=  {AhX^  -  6/i)|r?;  via 


with  j  preconditioner  for  Ah^i,  or  direct! 

The  local  smoothing  operators  can  be  a  further  multigrid  scheme  or  any 
other  scheme  like  Jacobi, GauU-Seidel,  ADI  or  ILU.  The  choose  of  the  method 
depends  on  the  local  structure  and  ’’hardness”  of  the  given  domain.  In  a  first 
step  this  decision  is  taken  by  the  user  to  choose  explicitly  the  method  but  in 
future  it  is  planned  to  create  an  ’’expert  system”  which  makes  this  decision 
widely  automatically. 

There  are  several  reasons  why  we  explicitely  use  this  basic  iteration,-. 

1.  This  general  form  allows  the  splitting  into  matrix-vector  multiplication,  pre¬ 
conditioning  and  linear  combination.  All  3  components  can  be  separately 
performed  with  high  performance  tools  if  available. 

2.  The  explicit  use  of  the  complete  defect  —  bh  is  advantageous  for  certain 
techniques  for  implementing  boundary  conditions  (see  [7]). 

3.  All  components  in  standard  multigrid,  i.e.,  smoothing,  defect  calculation, 
step-length  control,  grid  transfer,  are  included  in  this  basic  iteration. 

Finally  it  should  be  explained  what  the  notation  ScaRC  stands  for: 

-  Scalable,  w.r.t.  the  number  of  global  (’/’)  and  local  solution  steps  (’A:’). 

-  Recursive,  since  it  may  be  applied  to  more  than  2  global/local  levels, 

-  Clustering,  since  fixed  or  adaptive  blocking  of  substructures  is  possible. 

For  more  information  about  ScaRC  see  [3],  [4]. 
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Numerical  tests  For  examining  the  convergence  behaviour  of  ScaRC  w.r.t  lo¬ 
cal  anisotropies,  we  defined  two  types  of  anisotropic  MxM -topologies,  M  =  4,8, 
namely  T4{a,b,c)  and  T8{a,b,c).  Starting  from  the  equidistant  4 x 4-topology 
for  the  unit  square,  the  ’’amount  of  anisotropy”  can  be  parametrized  by  shifting 
the  inner  x-  coordinates -a,  6  and  c  to  the  left  side  of  the  domain  as  shown  in 
figure  5,  whereas  the  y-coordinates  retain  their  old  positions  of  0.25,  0.5  and 
0.75.  The  corresponding  8x8-topology  is  obtained  by  one  regular  refinement  of 
the  4x4  -topology. 


T4(a,b,c)  TB(a,fa,c) 


Fig.  5:  Parametrizable  anisotropic  4x4  -  and  8  x  8 -topologies 

Further,  we  added  a  local  refinement  procedure  which  provides  an  additional 
local  anisotropic  refinement  of  the  single  macros  corresponding  to  some  given 
input  parameters  (ri ,  r2,  rs),  which  indicate  the  amount  of  local  distortion.  This 
procedure  is  designed  in  a  special  way,  such  that  it  allows  the  use  of  the  standard 
multigrid  transfer  operators. 


Fig.  6:  Anisotropic  refinement  of  a  single  macro 


Figure  6  shows  the  refinement  of  the  left  macros  in  figure  5,  where  the  single 
elements  are  shifted  to  the  left.  By  means  of  these  both  procedures,  arbitrarily 
small  step  sizes  hmini  i-e.,  large  aspect  ratios  AR  are  obtained. 

For  different  smoothing  techniques  (global  Jacobi,  blockwise  SOR,  ScaRC) 
and  various  parameter  settings  of  (a,  6,  c).  Table  1  shows  the  required  number 
of  multigrid  iterations  #  for  a  relative  accuracy  of  10“®  and  corresponding  con¬ 
vergence  rates  p.  The  term  Ng  denotes  the  global  number  of  elements  in  one 
space  direction,  distributed  equally  over  the  single  macros.  For  all  smoothers  we 
performed  I  =  1,2,4  (global)  smoothing  steps  and  999  multigrid  iterations  in 
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Table  1;  Equidistant  refinement  on  different  anisotropic  M  xM  -  topologies 


the  maximum.  In  case  of  ScaRC  the  local  problems  have  been  solved  ’’exactly" 
with  peg-methods. 

Obviously  the  global  Jacobi  smoothing  provides  good  results  for  the  equidis¬ 
tant  4x4  -topology.  But  with  increasing  anisotropy  the  convergence  rates  de¬ 
teriorate  quite  drastically.  The  behaviour  of  the  blockwise  SOR-smoothing  is 
somewhat  better,  but  tends  worse  as  well.  Only  ScaRC  seems  to  be  able  to 
work  well  for  anisotropic  macro  structures.  But  due  to  the  underlying  block  Ja¬ 
cobi  character,  its  convergence  behaviour  must  depend  on  the  structure  of  the 
macro  decomposition  and  on  the  number  of  macros. 

For  the  settings  (a,  6,c)  =  (0.05,0.2,0.5)  and  (a,6,c)  =  (0.001,0.01,0.1),  ta¬ 
ble  2  compares  the  convergence  results  for  different  choices  of  the  local  refinement 
parameters  (ri ,  r2,  rs).  The  considered  cases  namely  (0.5, 0.5, 0.5)/(0.25, 0.25. 0.5) 
(which  have  to  be  compared  with  the  equidistant  case  (0.5, 1.0, 1.0)  from  the  pre¬ 
vious  table)  involve  a  moderatly  to  strongly  anisotropic  refinement  of  the  most 
left  hand  side  macros  of  the  topology,  up  to  a  finest  mesh  size  of  h„iin  «  4  •  10“®. 

Here,  the  global  Jacobi  smoother  leads  to  divergence  in  case  of  strongly 
anisotropic  refinement,  and  the  blockwise  SOR  smoothing  produces  relativly 
1)ad  convergence  rates,  as  well.  In  contrast,  ScaRC  provides  the  same  rates 
for  all  kind  of  local  anisotropies,  i.e.,  does  not  deteriorate  with  increasing  local 
anisotropy  (in  contrast  to  the  global  macro  structure). 
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Table  2:  Anisotropic  refinement  for  {a,b,c)  =  (0.05,0.2,0.5)  and  (a,  5,  c)  = 

(0.001,0.01,0.1) 


2.3  High  Performance  Linear  Algebra 


One  of  the  main  ideas  behind  the  described  (Recursive)  Divide  and  Conquer 
approach  in  combination  with  the  ScaRC  solver  technology  is  to  detect  ’’locally 
structured  parts”.  In  these  ’’local  subdomains”  we  apply  consequently  ’’highly 
structured  tools”  as  typical  for  Finite  Difference  approaches:  line-  or  rowwise 
numbering  of  unknowns  and  storing  of  matrices  as  sparse  bands  (however  the 
matrix  entries  axe  calculated  via  the  Finite  Element  modules).  As  a  result  we 
have  ”  optimal”  data  structures  on  each  of  these  patches  (which  often  correspond 
to  the  former  introduced  ’’matrix  blocks”)  and  we  can  perform  very  powerful 
linear  algebra  tools  which  explicitely  exploit  the  high  performance  of  specific 
machine-optimized  libraries  (i.e.  ESSL,  PERFLIB). 

We  have  ijerformed  several  tests  for  different  tasks  and  techniques  in  numeri¬ 
cal  linear  algebra  on  some  selected  hardware  platforms.  In  all  cases  we  attempted 
to  use  ’’optimal”  compiler  options  and  machine-optimized  linear  algebra  libraries 
like  the  ESSL  or  PERFLIB.  Only  in  the  case  of  the  Pentium  II  we  had  to  per¬ 
form  the  Gaussian  Elimination  with  the  Fortran  sources  exclusively  which  might 
explain  the  worse  rates. 


Gaussian  Elimination;  While  Gaussian  Elimination  (GE)  is  presented  only 
to  demonstrate  the  (potentially)  available  performance  of  the  given  processors 
(often  several  hundreds  of  MFLOP/s  which  are  really  measured),  we  are  much 
more  interested  in  the  realistic  runtime  behaviour  of  several  matrix-vector  multi¬ 
plication  (MV)  techniques.  The  measured  MFLOP  for  the  Gaussian  Elimination 
are  for  a  dense  matrix  (analogously  to  the  standard  Unpack  test). 
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Matrix-vector  multiplication:  We  examine  more  carefully  the  following  vari¬ 
ants  which  all  are  typical  in  the  context  of  iterative  schemes  with  sparse  matrices. 
The  test  matrix  is  a  typical  9-point  stencil  (’’discretized  Poisson  operator”).  We 
perform  tests  for  two  different  vector  lengths  N  and  give  the  measured  MFLOP 
rates  which  are  all  calculated  via  20  x  Nltime  (for  MV),  resp.,  2  x  N/time  (for 
DAXPY  (linear  combination)). 

Speirse  MV:  SMV 

The  sparse  MV  technique  is  the  standard  technique  in  Finite  Element  codes 
(and  others),  also  well  known  as  ’’compact  storage”  technique  or  similar:  the 
matrix  plus  index  arrays  or  lists  are  stored  as  long  gurrays  containing  the  nonzero 
elements  only.  While  this  approach  can  be  applied  for  arbitrary  meshes  and 
numberings  of  the  unknowns,  no  explicit  advantage  of  the  linewise  numbering 
can  be  exploited.  We  expect  a  massive  loss  of  performance  with  respect  to  the 
possible  peak  rates  since  —  at  least  for  larger  problems  —  no  ’’caching  in”  and 
’’pipelining”  can  be  exploited  such  that  the  higher  cost  of  memory  access  will 
dominate  the  resulting  MFLOP  rates. 

Banded  MV:  BMV 

A  ’’natural”  way  to  improve  the  sparse  MV  is  to  exploit  that  the  matrix  is  a 
banded  matrix  with  9  bands  only.  Hence  the  matrix-vector  multiplication  is 
rewritten  such  that  now  ’’band  after  band”  are  applied.  The  obvious  advantage 
of  this  banded  MV  approach  is  that  these  tasks  can  be  performed  on  the  basis  of 
BL AS  1-like  routines  which  may  exploit  the  vectorization  facilities  of  many  pro¬ 
cessors  (particularly  on  vector  computers).  However  for  ’’long”  vector  lengths  the 
improvements  can  be  absolutely  disappointing:  For  the  recent  workstation/PC 
chip  technology  the  processor  cache  dominates  the  resulting  efficiency! 

Banded  blocked  MV:  BBMVA,  BBMVL,  BBMVC 
The  final  step  towards  highly  efficient  components  is  to  rearrange  the  matrix- 
vector  multiplication  in  a  ’’blockwise”  sense:  for  a  certain  set  of  unknowns,  a 
corresponding  part  of  the  matrix  is  treated  such  that  cache-optimized  and  fully 
vectorized  operations  can  be  performed.  This  procedure  is  called  ”BLAS  2-1-”- 
style  since  in  fact  certain  techniques  for  dense  matrices  which  are  based  on 
routines  from  the  BLAS2,  resp.,  BLAS3  library,  have  now  been  developed  for 
such  sparse  banded  matrices.  The  exact  procedure  has  to  be  carefully  devel¬ 
oped  in  dependence  of  the  underlying  FEM  discretization,  and  a  more  detailed 
description  can  be  found  in  [2], 

While  BBMVA  has  to  be  applied  in  the  case  of  arbitrary  matrix  coefficients, 
BBMVL  and  BBMVC  are  modified  versions  which  can  be  used  under  certain 
circumstances  only  (see  [2]  for  technical  details).  For  example  PDE’s  with  con¬ 
stant  coefficients  as  the  Poisson  operator  but  on  a  mesh  which  is  adapted  in 
one  special  direction  only,  allow  the  use  of  BBMVL:  This  is  often  the  case  for 
the  Pressure-Poisson  problem  in  flow  simulations  (see  [7])  on  boundary  adapted 
meshes.  Additionally  version  BBMVC  may  be  applied  for  PDE’s  with  constant 
coefficients  on  meshes  with  equidistant  mesh  distribution  in  each  (local)  direc¬ 
tion  separately:  This  is  typical  for  tensor  product  meshes  in  the  interior  domain 
where  the  solution  is  mostly  smooth. 
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Computational  results  The  following  table  illustrates  the  above  discussed 
linear  algebra  routines  with  their  performeince  rates.  For  further  results  see  [1]. 
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Table  3:  Benchmark  results 


2.4  Several  adaptivity  concepts 

As  typical  for  modern  FEM  packages,  we  directly  incorporate  certain  tools  for 
grid  generation  which  allow  an  easy  handling  of  local  and  global  refinement  or 
coarsening  strategies:  adaptive  mesh  moving,  macro  adaptivity  and  fully 
local  adaptivity. 

Adaptive  strategies  for  moving  mesh  points,  along  boundaries  or  inner  struc¬ 
tures,  allow  the  same  logic  structure  in  each  ’’macro  block”,  and  hence  the  shown 
performance  rates  can  be  preserved.  Additionally,  we  work  with  adaptivity  con¬ 
cepts  related  to  each  ’’macro  block”.  Allowing  ’’blind”  or  ’’slave  macro  nodes” 
preserves  the  high-performance  facilities  in  each  ’’matrix  block”,  and  is  a  good 
compromise  between  fully  loccJ  adaptivity  and  optimal  efficiency  through  struc¬ 
tured  data.  Only  in  that  case,  that  these  concepts  do  not  lead  to  satisfying 
results,  certain  macros  will  loose  their  ’’highly  structured”  features  through  the 
(local)  use  of  fully  adaptive  techniques.  On  these  (hopefully)  few  patches,  the 
standard  ’’sparse”  techniques  for  unstructured  meshes  have  to  be  applied. 


2.5  Direct  integration  of  parallelism 

Most  software  packages  are  designed  for  sequential  algorithms  to  solve  a  given 
PDE  problem,  and  the  subsequent  parallelization  of  certain  methods  takes  of¬ 
ten  unproportionately  long.  In  fact  it  is  easy  to  say,  but  hard  to  realize  with 
most  software  packages.  However  the  more  important  step,  which  makes  par¬ 
allelization  much  more  easier,  is  the  design  of  the  ScaRC  solver  according  to 
the  hierarchical  decomposition  in  different  stages.  Indeed  from  an  algorithmic 
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point  of  view,  our  sequential  and  parallel  versions  differ  only  as  analogously 
Jacobi-  and  GauC-Seidel-like  schemes  work  differently.  Hence  all  parallel  ex¬ 
ecutions  can  be  identically  simulated  on  single  processors  which  however  can 
additionally  improve  their  numerical  behaviour  with  respect  to  efficiency  and 
robustness  through  GauC-Seidel-like  mechanisms. 

Hence  we  only  provide  in  FEAST  the  ’’software”  tools  for  including  par¬ 
allelism  on  low  level,  while  the  ’’numerical  parallelism”  is  incorporated  via  our 
ScaRC  solver  and  the  hierarchical  ’’tree  structure”.  However  what  will  be  ’’non- 
standard”  is  our  concept  of  (adaptive)  parallel  loadbalancing  which  is  oriented 
in  ’’total  numerical  efficiency”  (that  means,  ”how  much  processor  work  is  spent 
to  achieve  a  certain  accuracy,  depending  on  the  local  configuration”)  in  contrast 
to  the  ’’classical”  criterion  of  equilibrating  the  number  of  local  unknowns  (see 
[2]  for  detailed  information  and  examples  in  FEAST). 

3  Pre-  and  Postprocessing 

3.1  General  remarks 

As  remarked  in  the  introduction  the  pre-  and  postprocessing  should  be  realized 
in  main  tasks  by  a  general  framework  of  Java  based  programs  called  DeViSoR. 
DeViSoR  means  ’’Design  &  Visualization  Software  Resource”.  This  framework 
is  intented  to  perform  the  main  tasks  grid  generation  and  editing,  control  of 
the  calculation  and  visualization  of  the  results.  These  main  tasks  use  the  same 
ground  classes  (called  DFC  -  DeViSoR  Foundation  Classes)  and  the  same  user 
interface,  so  the  access  to  the  underlying  numerical  core  parts  are  performed  in 
the  same  manner. 

As  intented  in  the  introduction  the  various  subtasks  can  be  performed  on 
several  machines  which  communicates  over  a  network  system.  This  allows  the 
user  to  choose  the  suitable  system  for  the  corresponding  task,  e.g.  a  Silicon 
Graphics  workstation  for  the  visualization.  The  access  to  a  parallel  computing 
system  should  also  be  performed  by  a  Java  program.  This  allows  not  only  the 
developer  of  a  numerical  code  to  use  a  parallel  computer. 

DeViSoR  is  planned  to  be  an  ’’open  system”  for  the  developing  of  pre-  and 
postprocessing  tools  for  FEM  packages.  The  DeViSoR  foundation  classes  con¬ 
tain  the  basic  tools  to  handle  and  administrate  FEM  typical  structures.  Further 
applications  could  realize  e.g.  further  visualization  procedures  and  adaptions  to 
several  parallel  computer  systems. 

For  this  project  Java  as  implementation  environment  is  been  choosen.  Though 
Java  is  a  relative  ’’young”  programming  language  the  advantages  of  this  system 
are  significant.  The  ’’write  once,  run  anywhere”  capability  reduces  the  implemen¬ 
tation  effort  widely  against  combinations  like  C/C-b-b/OpenGL.  It  exist  only  one 
program  which  runs  without  any  modification  on  several  different  configurations 
like  Unix  workstations,  Linux  PCs,  Windows  PCs,  Macintoshs  and  man)*  more. 
A  further  advantage  is  the  core  class  library  for  various  subareas  like  file  hand¬ 
ling,  network  functions,  visualization  and  user  interface  facilities.  These  classes 
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are  easy  to  use  and  produce  an  pleasing  output.  The  use  of  additional  tools  like 
applications  builders  is  not  necessary.  The  most  disadvantage  of  nowadays  Java 
implementations  is  the  relative  low  performance  because  of  the  fact  that  Java  is 
an  interpreted  language.  However  further  developments  like  more  sophisticated 
interpreter  with  Just-In-Time  compiling  facilities  and  especially  the  native  Java 
processor  will  hopefully  close  this  performance  leak. 


3.2  Preprocessing:  DeViSoRGrid 

This  subprogram  should  support  the  generation  and  editing  of  2D  domains.  The 
two  main  parts  are  the  description  of  the  domain  boundary  and  the  generation 
of  the  grid  structure.  The  program  supports  several  boundary  elements  like  lines, 
arcs  and  splines,  further  it  is  planned  to  add  a  segment  which  consists  of  an  For¬ 
tran  subroutine  which  describes  a  parametrization.  This  allows  to  use  an  analytic 
description.  Several  triangular  and  quadrilateral  elements  are  supported.  Exten¬ 
sive  editing  possibilities  allow  the  user  to  delete,  move  and  adjust  the  boundaries 
and  elements.  For  the  future  it  is  planned  to  implement  simply  automatic  grid 
generators  for  producing  coarse  grids.  As  further  tasks  this  program  should  be 
able  to  read  many  formats  from  other  tools  like  CAD  systems  and  professional 
grid  generators  and  prepare  this  data  for  the  use  in  the  calculation  process. 


3.3  Processing:  DeViSoRControl 

DeViSoRControl  enables  the  user  to  control  the  calculation  und  to  follow  the 
calculation  progress.  Main  tasks  of  this  program  part  are  the  distribution  of  the 
macros  to  the  processing  nodes  (at  the  moment  manually,  in  future  automa¬ 
tically),  the  collecting  and  displaying  of  the  log  information  of  the  processing 
nodes  and  finally  the  configuration  of  the  ScaRC  algorithm  with  respect  to  the 
selection  of  smoothing/preconditioning  methods  on  a  given  hierarchical  layer, 
the  size  of  smoothing  steps  and  the  exit  criterion.  Furthermore  this  part  builds 
the  interface  to  the  other  DeViSoR  parts  for  the  pre-  and  postprocessing.  Fiom 
the  control  part  the  grid  program  is  invoked,  a  grid  is  editing.  For  this  grid  the 
user  selects  the  desired  solution  method  and  visualize  finally  the  results  with  the 
DeViSoRVision  program. 


3.4  Postprocessing:  DeViSoRVision 

The  last  part  in  the  current  project  is  the  DeViSoRVision  program  which  per¬ 
forms  the  visualization  task.  The  program  offers  several  techniques  to  visualize 
the  results  of  the  calculation  like  shading  techniques,  isolines,  particle  tracing 
(planned).  Furthermore  it  contains  an  animation  module  to  create  animations 
for  nonstationary  problems.  The  result  of  the  animation  can  be  stored  in  several 
formats  like  MPEG  and  AnimatedGIF. 


919 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


4  Conclusions  and  Outlook 

We  expect  the  first  version  of  FEAST  for  end  of  1998,  but  most  of  the  ’’niimer- 
ical’’  and  ’’computational”  ingredients  have  already  been  successfully  realized 
in  several  test  implementations  (see  the  references).  The  actual  status  of  the 
FEAST  project  and  further  information  can  always  be  obtained  from  our  web 
page; 


http : // gaia . iwr . uni-heidelberg. de/"f eatf low 


Nevertheless,  help  is  always  welcome:  for  instance  in  implementing  and  test¬ 
ing  many  auxiliary  components,  pre-  and  postprocessing,  ’’unit  square”  experts 

and  ’’computers  for  performance  measurements”,  etc. 
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Abstract  In  creating  interconnection  networks,  an  efficient  design  is  crucial  because  of  its 
impact  on  the  parallel  computer  performance.  A  routing  scheme  that  minimises  contention  and 
avoids  the  formation  of  hot-spots  should  be  included  in  the  design.  Static  schemes  are  not  able 
to  adapt  to  traffic  conditions.  We  have  developed  a  new  method  to  uniformly  distribute  traffic 
over  the  network  called  Distributed  Routing  Balancing  (DRB)  that  is  based  on  limited  and 
load-controlled  path  expansion  in  order  to  maintain  a  low  message  latency.  The  method 
uniformly  balances  the  communication  load  between  all  links  of  the  interconnection  network 
and  maintains  a  controlled  latency,  provided  that  total  bandwidth  requirements  do  not  exceed 
the  total  link  bandwidth  available  in  the  interconnection  network.  DRB  defines  how  to  create 
alternative  paths  to  expand  single  paths  (expanded  path  definition)  and  when  to  use  them 
depending  on  traffic  load  (expanded  path  selection  policies).  We  explain  the  DRB  principles 
and  show  the  performance  evaluation  of  the  method  carried  out  by  simulation. 


1.  Introduction 

In  the  evolution  of  multi-computers,  communication  performance  becomes  more 
and  more  important.  One  of  the  most  crucial  problems  that  affects  performance  in 
communications  is  message  contention.  A  sustained  contention  can  produce  hotspots 
[Pfi85].  A  hotspot  is  a  saturated  region  of  the  network,  i.e.  there  exists  more 
bandwidth  demand  than  the  network  can  offer  and  then,  messages  that  enter  this 
region  suffer  a  very  high  latency,  while  other  regions  of  the  network  can  be  less 
loaded,  or  even  far  away  from  saturation.  The  problem  here  is  that  there  exists  a  poor 
communication  load  distribution  and  that,  although  the  total  communication 
bandwidth  requirements  do  not  surpass  the  total  offered  bandwidth  of  the 
interconnection  network,  this  uneven  distribution  causes  saturated  points  as  if  the 
whole  interconnection  network  were  collapsed.  In  addition,  the  hot-spot  propagates 
rapidly  to  contiguous  areas  in  a  domino  effect,  which  is  even  worse  in  the  case  of 


‘  This  work  has  been  supported  by  the  Spanish  Comision  Interministerial  de 
Ciencia  y  Tecnologia  ( CICYT)  under  contract  number  TIC  95/0868. 
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wormhole  routing  because  a  blocked  packet  occupies  a  large  number  of  links  spread 
in  the  network. 

Latency  must  be  avoided  in  order  to  make  communications  faster,  but  some 
amount  of  latency  can  be  tolerated  and  it  is  much  more  important  to  avoid  big  latency 
variations.  This  is  because  latency  can  be  hidden  by  the  mapping  task  assigning  an 
excess  of  parallelism,  i.e.  having  enough  processes  per  processor  and  scheduling  any 
ready  process  while  other  processes  wait  for  their  messages.  However,  in  order  to  be 
able  to  assign  processes  to  processors  correctly,  the  mapping  task  must  know  the 
process  computation  and  communication  volumes  and,  to  some  extent,  the  latency 
that  messages  will  suffer.  But  if  latency  undergoes  big  unpredictable  variations  from 
the  expected  values,  due  to  hotspots,  for  example,  the  mapping  will  fail  and  idle 
processors  will  appear,  increasing  the  total  execution  time  of  the  application.  This  is 
the  reason  for  the  importance  of  a  low  and  uniform  contention  latency.  In  addition,  in 
Distributed  Shared  Memory  Multicomputers  latency  uniformity  is  a  key  issue  for 
scheduling  and  mapping  in  such  systems. 

In  order  to  avoid  hotspot  generation,  static  or  oblivious  routing  can  not  provide 
any  help  because,  under  this  routing,  a  message  route  is  completely  determined  by  the 
source-destination  pair,  independent  of  traffic  conditions.  Therefore,  other 
mechanisms  have  been  developed  to  avoid  hotspot  generation  in  interconnection 
networks  like  adaptive  routing  algorithms  that  try  to  adapt  to  traffic  conditions  such 
as  Planar  Adaptive  Routing  [CK92],  the  Turn  Model  [NG92],  Duato’s  Algorithm 
[Dua93],  Comprensionless  Routing  [KLC94],  Chaos  Routing  [Ksn91],  Random 
Routing  [Val81]  [May93]  and  other  methods  presented  in  [DYN97].  The  main 
disadvantage  of  adaptive  routing  is  the  high  overhead  because  of  information 
monitoring,  path  changing  and  the  necessity  to  guarantee  deadlock,  livelock  and 
starvation  freedom.  These  drawbacks  have  limited  the  implementation  of  these 
techniques  in  commercial  machines. 

The  work  presented  in  this  paper  focuses  on  developing  new  methods  to  distribute 
paths  in  the  interconnection  network  using  network-load  controlled  path  expansion. 
The  method  is  called  Distributed  Routing  Balancing  (DRB)  and  its  objective  is  to 
uniformly  balance  traffic  load  over  all  paths  in  the  whole  interconnection  network  by 
creating  alternative  paths  between  each  source  and  the  destination  nodes  in  order  to 
maintain  a  low  message  latency.  DRB  defines  how  to  create  alternative  paths  to 
expand  single  paths  (multi-lane  path  definition)  and  when  to  use  them  depending  on 
traffic  load  (multi-lane  path  selection  policy). 

The  next  section  explains  the  Distributed  Routing  Balancing  technique.  DRB  has 
two  components:  first,  a  systematic  methodology  to  generate  the  multi-lane  paths  and 
second,  policies  to  monitor  traffic  load  and  select  multi-lane  paths  to  get  the  message 
distribution  according  to  traffic  load.  Both  are  explained  in  Section  3  and  Section  4, 
respectively.  Sections  5  presents  the  evaluation  of  the  first  DRB  component  and 
Section  6  the  validation  of  the  selection  policies.  Section  7  presents  the  conclusions. 
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2.  Distributed  Routing  Balancing 

Distributed  Routing  Balancing  is  a  method  to  create  alternative  source-destination 
paths  in  the  interconnection  network  using  a  load-controlled  path  expansion.  DRB 
distributes  every  source-destination  message  load  over  a  multi-lane  path  made  of 
several  paths.  The  objective  of  DRB  is  a  uniform  distribution  of  the  traffic  load  over 
the  whole  interconnection  network  in  order  to  maintain  a  low  message  latency  and 
avoid  the  generation  of  hotspots.  When  a  single  source-destination  path  is  becoming 
saturated,  the  method  looks  for  low  loaded  paths  to  form  a  multi-lane  path.  This 
distribution  will  maintain  a  uniform  and  low  latency  on  the  whole  interconnection 
network  provided  that  total  communication  bandwidth  demand  does  not  exceed 
interconnection  network  capacity. 

The  DRB  method  fulfils  the  following  objectives; 

1  Reduction  of  the  message  latency  under  a  certain  threshold  value  by  varying  the 
number  of  alternative  paths  used  by  the  source-destination  pair,  while  maintaining  a 
uniform  latency  for  all  messages. 

2  Minimisation  of  path-lengthening.  This  is  important  for  Store&Forward  networks 
because  Transmission  Delay  depends  directly  on  the  message  path  lengths.  For 
Wormhole  and  Cut-Through  flow  controls,  it  is  important  because  the  more  nodes 
used  by  the  message,  the  more  collisions  with  other  messages,  causing  latency 
increments  and  more  bandwidth  use. 

3  Maximisation  of  the  use  of  the  source  and  destination  node  links  (node  grade), 
distributing  messages  fairly  over  all  processor  links. 

In  order  to  show  how  DRB  works  to  create  and  use  alternative  paths,  we  make  the 
following  definitions: 

Deflnition  0: 

An  interconnection  network  I  is  defined  as  a  directed  graph  I=(N,E),  where  N  is  a 

MaxN 

set  of  nodes  N=  [J  N-  and  E  a  set  of  links.  Every  node  is  composed  of  a  router  and 
(=0 

is  connected  to  other  nodes  by  means  of  links.  In  regular  networks,  a  regular 
topology  is  defined,  with  a  dimension  and  size.  For  example,  for  k-ary  n-cubes 
[Dal90],  n  is  the  dimension  and  k  the  size.  For  irregular  networks,  an  irregular 
topology  is  defined. 

•If  two  nodes  N,.  and  Nj  are  directly  connected  by  a  link,  then,  and  N j  are 

adjacent  nodes. 

•Distancei  N.  ,  Nj )  is  the  minimum  number  of  links  that  must  be  traversed  to  go 
from  N.  to  N I  according  to  the  graph  /. 
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•A  path  P(  N.  ,  Nj  )  between  two  nodes  Ni  and  Nj  is  the  set  of  nodes  selected 

between  N-  and  Nj  according  to  the  minimal  static  routing  defined  for  the 
interconnection  network.  Ni  is  the  source  node  and  Nj  the  destination  node.  Length 
of  a  path  P  Length(P)  is  the  number  of  links  traversed  between  and  Nj . 


Definition  1: 

i  ^ 

A  Supernode  S(type,  size,  Nq  ,V(S))  =  Un-  is  defined  as  a  structured  region  of 

1=0 

the  interconnection  network  consisting  of  adjacent  nodes  Nf  around  a  “central” 
node  A^q  provided  that  comply  with  a  given  property  specified  in  type  and 

that  distancei  Nf  ,  j<=  size.  Each  node  has  an  associated  weight  stored 

in  the  array  V(S). 

V'('Sj=  {0<  <  1  6  R,i  =  0../}  is  a  linear  array  of  weights  associated 

I 

with  each  Nf  where  wf  is  the  weight  of  Nf  and  '^wf  =  1 .  This  weights  are 

1=0 

used  by  the  DRB  selection  policy  (Sect.  4).  As  particular  cases,  any  single  node 
and  the  whole  interconnection  network  are  Supernodes.  A  node  can  belong  to 
more  than  one  Supemode.  Section  3  presents  different  Supemode  types. 


Definition  2: 

A  Multi-step  Path  PfSOrigin,  Nf^"^‘" , 


between  two 

asp,  =  n 


,  SDest)  is  the  path  generated 
Supemodes  SOrigin  and  SDest 

r  SDest 


^  SUrtj;m  yy  bDest  SDest 


P(  ,  Nf^">^"’  ).P(  ,  ATf  ).P(  A^f  ,  Nff^‘-^' ),  where  .  means 

path  concatenation,  composed  of  the  following  steps: 

Step  1 :  From  the  central  node  of  the  Supernode  Supernode  jOrigin,  Nf°">^‘\  to  a 
node  belonging  to  SupernodejOrigin, 

Step  2:  From  the  Nf^"^'"  to  a  node  belonging  to  Supernode _Destination, 

Step  3:  From  the  N to  the  central  node  of  the  Supernode 
Supernode_Destination,  Nf^’"'' . 

In  the  most  general  case,  there  are  three  steps.  However,  if  one  of  the  Supernodes 
Supernode_Origin=l  N }  or  Supernode _Destination  =(  Nf‘^‘"''  },  the  number 
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of  steps  is  two;  and  there  is  one  step,  the  one  which  follows  static  routing,  if 
SupernodejOrigin={  }  and  Supernode _Destination={  }. 

Length  of  a  multi-step  path  Length(Ps)  is  defined  as  the  sum  of  each  individual 
step  length  following  static  routing.  From  this  definition,  it  can  be  seen  that  Multi- 
step  Paths  between  and  '  can  be  of  non  minimal  length. 


Deflnition  3: 

A  Metapath  P*(Supernode_Origin,Supemode_Destination)  is  the  set  of  all  multi- 
step  paths  P.  generated  between  the  Supemodes  SupemodejOrigin  and 

Supernode_Destination:  P* —  \^  ^  NT  '  N  j  ’  A^o  ^  ^ 

v/.j 

is  the  number  of  nodes  of  Supernode  jDrigin  and  k  the  number  of  nodes  of 
Supernode_Destination.  The  number  of  Multi-step  Paths  which  compose  the 
Metapath  is  s=l*k.  Metapath  Length  (ML)  is  the  average  of  all  the  individual 
multi-step  path  lengths  that  compose  it,  Length{P*)  =  (1/  s)^fength(PJ 

V.v 

and  Metapath  Relative  Bandwidth  (MRB)  is  the  number  of  multi-step  paths  s. 

Now,  we  explain  how  communication  is  managed  under  DRB  to  get  path 
distribution.  Suppose  there  is  a  parallel  program  described  as  a  collection  of  processes 
and  channels  and  a  process  mapping  which  assigns  each  process  to  a  processor. 
Processes  are  executed  concurrently  and  communicate  by  channels. 

The  routing  run-time  support  configures  a  Metapath  P*  for  each  channel  by 
assigning  a  Source  Supernode  to  the  source  node  and  a  Destination  Supernode  to  the 
destination  node.  This  function  is  carried  out  by  a  Metapath  selection  policy  called 
Metapath  Issuing  through  Latency  Evaluation  (MILE)  which  is  described  in  Section 
4.  The  Source  Supernode  is  a  Message  Scattering  Area  (MeSA)  from  the  Source 
node.  The  Destination  Supernode  is  a  Message  Gathering  Area  (MeGA)  to  the 
Destination  node. 

Then,  for  each  message  that  the  source  process  wants  to  send,  the  channel  manager 
in  the  routing  run-time  support,  selects  two  nodes,  one  from  the  Source 

Supernode,  and  the  other  from  the  Destination  Supemode  to  form  a  Multi- 

step  Path  P^  {SOrigin,  ^  ,  SDest)  belonging  to  the  Metapath  P*. 

These  node  selections  are  made  based  on  the  weight  arrays  which  are  used  as 
probability  distributions  for  each  node  Then,  the  message  travels  along 

the  selected  Multi-step  Path. 

Under  this  scheme,  the  communication  between  source  and  destination  can  be  seen 
as  if  it  were  using  a  wider  multi-lane  “Metapath”  of  potentially  higher  bandwidth  than 
the  original  path  from  a  source  “Supernode”  and  to  a  destination  “Supernode”.  This 
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multi-lane  path  can  be  likened  to  a  highway  and  the  MeSA  and  MeGA,  the  highway 
access  and  exit  areas,  respectively. 

It  is  important  to  remark  that,  in  order  to  achieve  an  effective  uniform  load 
distribution,  a  global  action  is  needed.  For  this  reason,  all  source-destination  nodes 
are  able  to  expand  their  paths  depending  on  the  message  traffic  load  between  them 
during  program  execution. 

The  mentioned  DRB  Metapath  Selection  Policy  defines  the  Metapath  type  and  size 
to  determine  a  Metapath  of  specific  length  and  bandwidth  depending  on  traffic 
conditions  (Section  4).  The  policy  needs  to  know  the  length  and  bandwidth  of  a 
Metapath  given  the  type  and  size.  Therefore,  a  Metapath  Characterisation  is  needed  to 
determine  the  Length  and  Relative  Bandwidth  of  a  Metapath  given  its  type  and  size. 
This  characterisation  has  been  carried  out  by  experimentation  and  is  explained  in 
Section  5. 

Comparison  with  existing  methods 

Many  adaptive  methods  try  to  modify  current  path  when  a  message  arrives  to  a 
congested  node.  This  is  the  case,  for  example,  of  Chaos  routing  [KSn91]  which  uses 
randomisation  to  missroute  messages  when  the  message  is  blocked.  DRB  does  not  act 
at  the  individual  message  level,  but  tries  to  adapt  communication  flow  between 
source  and  destination  nodes  to  non-congested  paths. 

Random  routing  algorithms  [Val81]  [May93]  uniformly  distribute  bandwidth 
requirements  over  the  whole  machine,  independent  of  the  traffic  pattern  generated  by 
the  application  but  at  the  expense  of  doubling  the  path  length.  A  closer  view  shows 
that  paths  of  maximum  length  are  not  lengthened  but  paths  of  length  one  are 
lengthened,  in  average,  up  to  the  average  distance  for  regular  networks.  So,  the 
shortest  paths  are  extremely  affected.  This  is  due  to  the  method  being  “blind”,  namely 
it  does  not  take  into  account  current  traffic  and  it  distributes  all  messages  at  “brute 
force”  over  the  entire  machine.  Although  DRB  shares  some  objectives  with  random 
routing,  the  difference  is  that  DRB  does  not  only  try  to  maintain  throughput  but  also 
maintains  limited  individual  message  latency,  because  path  lengthening  can  be 
controlled.  Random  routing,  however,  doubles  in  average  the  lengthening  with  the 
negative  effect  over  the  latency  we  mentioned  above.  It  can  be  seen  that  static  routing 
is  an  extreme  case  of  DRB  in  which  both  Supernodes,  source  and  destination,  contain 
only  the  source  or  destination  node,  respectively;  and  that  random  routing  is  the  other 
extreme  in  which  the  source  Supernode  contains  all  nodes  of  the  interconnection 
network. 

A  similar  but  restricted,  less  flexible  and  non-adaptive  solution  is  offered  by  the 
IBM  SP2  routing  algorithm,  RTG,  that  statically  selects  four  paths  for  each  source- 
destination  node  which  are  used  in  a  “round-robin”  fashion  to  more  uniformly  utilise 
the  network  [Sni95].  The  Meiko  CS-2  machine  also  pre-establishes  all  source- 
destination  paths  and  select  four  alternative  paths  to  balance  the  network  traffic 
[Bok96]. 

Like  other  adaptive  methods,  DRB  can  introduce  the  possibility  of  deadlock  in 
which  case  one  of  the  existing  techniques  for  deadlock  avoidance  should  be  used 
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(such  as  structured  buffer-pool  [MerSO],  virtual  channels  [Dal87]  or  virtual  networks 
[Yan89])  depending  on  network  characteristics  (topology,  flow  control,  etc.)- 

It  can  be  seen  that,  by  definition  DRB  is  livelock  free,  because  it  never  produces 
infinite  path  lengths,  and  also  starvation  free,  because  no  node  is  prevented  from 
injecting  their  messages.  Also,  message  ordering  must  be  preserved  and  it  is  the 
system’s  responsibility  to  deliver  messages  belonging  to  the  same  logical  channel. 
Message  prefetching  can  be  used  to  hide  message  disordering. 


3.  Supernode  Types 

We  have  defined  two  Supernode  types  suitable  for  any  topology  and  which  define 
a  broad  range  of  Supernodes  that  allow  a  choice  according  to  the  desired  trade-offs 
between  Metapath  length  (ML)  and  Relative  Bandwidth  (MRB).  The  first  one  is 
called  Gravity  Area  and  the  second  Subtopology. 

The  parameters  type  and  size  of  the  Supernode  determine  which  nodes  are  included 
in  the  Supernode.  A  given  Supemode  has  the  following  characteristics:  Topological 
shape.  Number  of  nodes  /,  Grade  (number  of  M  node  links  not  eonnected  to  other  Ni 
node  links,  i.e.  links  connected  outside  the  Supemode)  and  Grade  usage  of  the  central 
node  in  the  first  step. 

Gravity  Area  Supernode 

The  first  Supernode  type,  S( "Gravity  Area” .size,  Nq  ,V(S)),  is  called  Gravity  Area 

and  defines,  for  a  n-grade  network,  a  n-ary  tree  with  the  root  at  the  central  node  N q 
and  a  deep  size.  This  type  maps  a  tree  of  maximum  grade  over  the  topology.  This  tree 
expands  at  maximum  and  includes  all  nodes  which  are  at  distance  size  or  less  from 
the  root.  It  is  suitable  for  regular  or  irregular  networks.  A  Gravity  Area  Supernode  is 

the  set  of  nodes  at  a  distance  smaller  or  equal  than  the  size  of  the  node  N ^  . 
Metapaths  configured  using  Gravity  Area  Supemodes  fulfil  the  above  mentioned 
objectives  of  maximising  the  number  of  paths  while  minimising  path-lengthening  and 
maximising  node  link  usage  since  they  make  use  of  all  node  links. 

Subtopology  Supernode: 

The  second  Supernode  type  is  called  Subtopology  and  defines  the  topological 
shape  of  the  Supernode.  It  can  be  applied  to  regular  networks  with  a  structured 

topology,  dimension  and  size.  A  Supernode  S(" Subtopology  ".size,  Nq  ,V(S))  has  the 
same  full/partial  topological  shape  as  the  interconnection  network  but  its  dimension 
and/or  size  is  reduced.  Therefore,  the  Subtopology  Supernode  should  be  considered  as 
a  kind  of  topological  “projection”  of  the  network  topology.  For  example,  in  a  k-ary  n- 
cube  a  Subtopology  Supernode  is  any  j-ary  m-cube  with  j<k  and/or  m<n.  For 
Midimiew  networks,  which  are  a  special  case  of  wraparound  torus  (k-ary  2-cubes), 
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Subtopology  Supernodes  are  k-ary  1-cubes,  i.e.  linear  structures  which  follow  specific 
wraparound  links. 


4.  Metapath  Selection  Policy 


This  section  describes  the  Metapath  selection  policy  to  select  a  Metapath  for 
source  destination  pairs  according  to  the  current  message  load.  We  have  designed  and 
present  here  a  dynamic  policy  for  DRB  called  Metapath  Issuing  through  Latency 
Evaluation  (MILE),  which  has  been  designed  trying  to  minimise  overhead  and  to  be 
scalable.  To  this  end,  there  is  no  periodic  information  exchange  and  MILE  is  fully 
distributed.  It  has  the  characteristic  that,  under  low  traffic  load,  the  monitoring 
activity  is  minimum  and  the  paths  follow  minimal  static  routing.  The  monitoring 
activity  objective  is  to  identify  the  current  traffic  pattern. 

Policy  objectives  are  to  select  the  supernode  size  and  type  and  to  distribute  the  load 
among  the  Multi-step  Paths  of  the  Metapath.  The  policy  consists  of  three  phases: 
Traffic  Load  Monitoring,  Dynamic  Supernode  Configuration  and  Multi-Step  Path 
Selection.  Traffic  load  monitoring  is  carried  out  by  the  messages.  Latency  suffered  is 
recorded  and  carried  by  the  message  itself  The  message  records  information  of  the 
contention  it  suffers  at  each  node  it  traverses  when  it  is  blocked  by  contention  with 
other  messages.  When  messages  arrive  at  their  destination  carrying  latency 
information,  the  destination  node  takes  a  decision  about  the  Supernode  configuration 
depending  on  the  contention  that  the  messages  encountered  on  the  Metapath.  This 
Metapath  Configuration  is  sent  to  the  Source  node  by  means  of  an  acknowledge 
message  which  distributes  messages  following  DRB  specification. 

The  policy  algorithm  pseudo-code  for  each  source  destination  pair  is  presented  as 
follows:  _ 

Traffic  Load  MonitoringO/ *Actions  performed  by  the  messages*/ 

Begin 

for  each  hop, 

Accximulate  latency 

BndFor 

Deliver  latency  to  the  Destination  algorithm. 

End 


Dynamic  Supemode  Configuration  (threshold  latency  ThL) 

/‘Algorithm  executed  at  destination  nodes*/ 

/‘Threshold  latency  is  the  change  latency  from  the  flat  region 
to  the  rise  region  defined  in  Sec.  2*/ 

Begin 

For  each  Multi-step  Path  of  the  Metapath 

Receive  Multi-step  Latency  recorded  by  the  message  itself. 

EndFor 

Order  Multi-step  Paths  according  to  the  Latency  they  suffered 
Classify  Multi-step  Paths  as  saturated  or  non-saturated 
depending  on  whether  their  latencies  exceed  the  ThL  or  not. 
Calculate  Total  Latency  (TL)  adding  each  Multi-step  Latency. 
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Calculate  Total  Threshold  Latency  (TThL)  multiplying 
TL*Metapath  Relative  Bandwidth (MRB) 

/*Load  Balancing:  Distribute  traffic  load  among  the  Multi- 
step  Metapaths*/ 

If  (TL  does  not  exceed  TThL  and  there  exist  saturated  Multi- 
step  Paths) 

5 

Redistribute  Supernode  node  weights  to  move  load  from 

saturated  Multi-step  Paths  to  non-saturated  Multi-step  Paths 
Elaelf  (TL  does  not  exceed  TThL  and  there  do  not  exist 
saturated  Multi-step  Paths) 

Reduce  Multi-step  Paths  (reducing  Supernode 
configurations) 

5 

Redistribute  Supernode  node  weights  to  move  load  from 
disappearing  Multi-step  Paths  to  the  other  Multi-step  Paths 
Elself  (TL  exceed  TThL) 

Add  new  Multi-step  Paths  (expanding  Supernode 
configurations)  as  non-saturated  Multi-step  Paths 

5 

Redistribute  Supernode  node  Wj  weights  to  move  load  from 
saturated  Multi-step  Paths  to  non-saturated  Multi-step  Paths 

Endlf 

Send  new  Supernode  Configurations  and  weights  to  the  Source 
node  by  means  of  an  aclcnowledge  message. 

End  Destination 


Multi-Step  Path  SelectionO  /‘Actions  executed  by  sender  nodes*/ 

Baffin 

Receive  new  Supernode  configurations 

Distribute  subsequent  messages  among  the  Supernode  nodes 

XT  S  •  ^ 

/V.  ,  according  to  their  corresponding  weights  . 

End 


This  DRB  MILE  takes  advantage  of  the  spatial  and  temporal  locality  of  parallel 
program  communications,  like  cache  memory  systems  do  with  memory  references. 
The  algorithm  adapts  the  Metapath  configurations  to  the  current  traffic  pattern.  While 
this  pattern  is  constant,  latencies  will  be  low  and  the  MILE  is  not  activated.  If  the 
application  changes  to  a  new  traffic  pattern  and  message  latencies  change,  the  MILE 
will  adapt  Metapaths  to  the  new  situation.  DRB  is  useful  for  persistent 
communication  patterns  which  are  the  ones  which  can  cause  the  worst  hotspot 
situations.  This  Metapath  adaptability  is  specific  and  can  be  different  for  each  source- 
destination  pair  depending  on  their  static  distance  or  latency  conditions. 

Memory  space  and  the  execution  time  overhead  of  the  policy  is  very  low  because 
the  implied  actions  are  very  simple.  In  addition,  these  activities  are  executed  a 
number  of  times  which  is  linearly  dependent  of  the  number  of  logical  channels  of  the 
application  and  the  number  of  messages.  Regarding  the  time  overhead,  it  can  be  seen 
that  monitoring  is  just  latency  record  by  the  message  itself,  i.e.  storing  a  few  integers, 
and  that  the  decision  algorithm  is  a  local  and  simple  computation  applied  only  each 
time  a  latency  rise  is  detected.  Regarding  the  space  overhead,  the  latency  record  is 
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one  or  a  few  integers  that  the  message  carries  itself  in  its  header,  and  the  new 
Supemode  configuration  information  is  a  short  message  of  integers. 


5.  Metapath  characterisation 

This  section  explains  the  metapath  characterisation  that  has  been  carried  out  to 
determine  the  Metapath  Length  (ML)  and  Metapath  Relative  Bandwidth  (MRB)  of  a 
Metapath  given  its  type  and  size.  A  series  of  experiments  have  been  carried  out  for  all 
Metapath  types  and  sizes  and  for  k-ary  n-cubes  and  Midimews  from  8  to  64K  nodes. 

For  a  given  topology  and  metapath  type  and  size,  the  experiment  consisted  in 
calculating  the  average  ML  and  average  MRB  by  averaging  ML  and  MRB  for  all 
generated  Metapaths  when  changing  the  source  and  destination  nodes. 

This  average  ML  for  a  specific  Metapath  is  considered  as  the  average  network 
distance  of  the  interconnection  network  under  the  routing  defined  by  the  Metapath. 
This  average  network  distance  has  been  compared  to  the  average  network  distance  for 
static  and  random  routings. 

In  addition,  the  Standard  deviation  of  the  lengths  of  the  Multi-step  Paths  which 
compose  the  Metapath  was  calculated.  The  standard  deviation  shows  the  uniform 
fairness  of  the  method  in  relation  to  path  lengthening. 

As  an  example  of  the  results  obtained,  fig.  1  shows  a  chart  for  a  1024-node 
(32x32)  2D  Torus  and  a  1024  lOD  HyperCube.  The  chart  shows  the  Relative 
Metapath  Average  Length,  i.e.  the  percentage  relation  between  the  average  ML  and 
average  network  distance  for  static  routing,  and  the  Metapath  Relative  Bandwidth. 
The  X  axis  represents  different  Metapaths  sizes  for  the  Metapaths  M(S„,  S^)  where 
SJ“Gravity  Area", size, Nf" ,V(S))  and  S„( ‘‘Gravity  Area", size,  N^j‘^,V(S)).  Each  Xi 

axis  point  is  the  average  of  the  MLs  for  the  Metapaths  of  size=X\  generated  for  all 
source  and  destination  nodes. 


Fig.  1.  .Metapath  characterisation 
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We  can  make  the  following  observations  from  the  above  charts  for  Gravity  Area 
methods.  Looking  at  Relative  Metapath  Average  Length  values,  it  can  be  seen  that 
Metapath  lengthening  growth  is  proportional  to  the  supernode  size  for  Torus  and 
HyperCubes.  Regarding  Metapath  Relative  Bandwidth,  it  can  be  seen  that  it  has  a 
much  higher  growth  with  respect  to  metapath  size  than  Metapath  Length. 

In  addition,  the  results  show  a  similar  behaviour  for  equivalent  supernodes  in 
different  topologies,  which  demonstrates  method  uniformity.  Besides  the  broad  range 
of  alternatives  offered  by  the  methods,  gravity  area  methods  offer  a  higher  extra 
bandwidth/  latency  rate  than  Subtopology  methods.  Similar  results  have  been  found 
for  all  topology  sizes  from  9  to  64K  processors  which  proves  the  good  scalability  of 
the  methods.  A  complete  presentation  of  these  results  is  found  in  [Gar97]. 


6.  DRB  MILE  Policy  Evaluation 

This  section  shows  the  results  for  different  traffic  patterns  and  network  loads  with 
a  fixed  message  length  and  compare  the  performance  with  that  of  static  routing. 

The  simulations  consisted  of  sending  packets  through  the  network  links  according 
to  a  specific  traffic  pattern.  The  simulations  were  conducted  for  an  8x8  torus  with  bi¬ 
directional  links.  We  have  assumed  wormhole  flow  control  and  10  flits  per  packet. 
Each  link  was  designed  to  have  only  one  flit  buffer  associated  with  it.  The  packet 
generation  rate  followed  an  exponential  distribution  whose  average  was  the  message 
interarrival  time.  The  results  were  run  many  times  with  different  seeds  and  were 
observed  to  be  consistent.  The  simulation  was  carried  out  for  100,000  packets.  The 
effects  of  the  first  20,000  delivered  packets  are  not  included  in  the  results  in  order  to 
lessen  the  transient  effects  in  the  simulations. 

We  have  chosen  some  of  the  communication  patterns  commonly  used  to  evaluate 
interconnection  networks  [DYN97]:  Uniform,  hot-spot.  Matrix  transposition  and 
butterfly. 

We  have  studied  the  average  communication  latency,  the  average  throughput  of 
the  network,  and  the  traffic  load  distribution  in  the  network.  The  communication 
latency  was  measured  as  the  total  time  the  packets  have  to  wait  to  access  the  link 
from  source  to  destination.  The  throughput  was  calculated  as  the  percentage  relation 
between  the  accepted  and  the  applied  communication  loads  measured  as  number  of 
messages  per  unit  time.  In  order  to  show  the  traffic  load  distribution,  we  measure  the 
average  latency  in  each  link  of  the  network.  The  experiments  were  conducted  for  a 
range  of  communication  traffic  load  from  low  load  to  saturation. 

Results 

Under  uniform  traffic,  there  does  not  exist  load  unbalance  and,  therefore,  DRB 
routing  does  not  modify  the  load  distribution  of  the  network  load,  resulting  in  almost 
the  same  average  latency  and  average  throughput  of  the  network  for  all  ranges  of 
load.  This  is  the  behaviour  expected  according  to  DRB’s  definition  which  can  not 
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improve  this  situation.  We  do  not  show  the  figures  for  this  case.  For  the  other  three 
traffic  patterns  the  behaviour  changes  in  a  very  different  way. 

Fig.2  shows  the  latency  results  for  the  hot-spot  traffic,  the  bit-reversal  and  butterfly 
traffic  patterns.  DRB  routing  demonstrates  better  performance.  It  can  be  seen  that  at 
low  load  rates,  DRB  behaves  nearly  equal  to  static  routing.  This  means  that  the  DRB 
method  does  not  charge  the  network  when  it  is  not  necessary.  While  load  is 
increasing,  latency  improvements  are  increasing  too,  resulting  in  latency  reductions 
bigger  than  50%  at  the  highest  load. 

At  the  same  time  as  these  latency  improvements  are  achieved,  the  throughput  is 
increased  as  can  be  seen  in  Fig.  2  for  all  traffic  patterns.  The  throughput  is  improved 
up  to  a  50%.  The  conclusions  are  that  more  messages  are  sent  and  with  less  latency 
and  that  the  network  saturation  point  is  reached  at  higher  load  rates  because  DRB 
routing  maintains  uniform  load  distribution  getting  a  better  use  of  network  resources 
for  all  tested  traffic  patterns. 


Latency  TORUS  4X4 


0  20  40  60  80  TOO  120 

Message  Geneiatton  Inteivol  (Cycles) 


Fig.  2.  Performance  results  for  DRB  routing:  average  latency,  average  throughput. 


In  order  to  show  how  DRB  Routing  distributes  load  and  eliminates  hot-spots.  Fig. 
3  shows  the  latency  surface  for  network  links  for  the  hot-spot  traffic  pattern  at  a  load 
rate  of  30  cycles  as  message  generation  interval.  Each  grid  point  represents  the 
average  latency  of  the  node  links  of  the  torus.  It  can  be  seen  that,  using  Static 
Routing,  big  hot-spots  appear  in  the  network  while  other  regions  of  the  network  are 
only  slightly  used.  The  maximum  average  latency  in  the  hot-spots  is  around  15 
cycles.  When  using  DRB  Routing,  these  hot-spots  are  effectively  eliminated  because 
the  load  excess  of  the  hot-spot  nodes  is  distributed  among  other  links.  The  maximum 
average  latency  in  this  case  is  about  3.5  cycles. 
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Fig.  3.  Latency  distribution  for  the  hot-spot  pattern 


7.  Conclusions 

Distributed  Routing  Balancing  is  a  new  method  for  message  traffic  distribution  in 
interconnection  networks.  DRB  has  been  developed  to  try  to  fulfil  the  design 
objectives  for  parallel  computer  interconnection  networks.  These  objectives  are  all-to- 
all  connection  and  low  and  uniform  latency  between  any  pair  of  nodes  and  under  any 
message  traffic  load.  Traffic  distribution  is  achieved  by  defining  Supernodes  which 
firstly  send  messages  to  an  intermediate  destination  before  sending  them  to  their  final 
destination.  Two  Supemodes  are  defined,  the  first  one  is  centred  at  the  source  node 
and  the  second  at  the  destination  node.  Only  one  or  both  kinds  can  be  used  resulting 
in  one  or  two  intermediate  destinations  for  each  source-destination  pair. 

DRB  has  two  components.  The  first  component  is  Supemode  definition  and  the 
second  is  Metapath  selection. 

Supemode  definition  has  been  explained  and  its  parameters  (latency/bandwidth) 
characterised  experimentally  for  its  subsequent  selection  in  the  adaptive  phase.  The 
new  type  of  Supernode  Gravity  Area  turns  out  to  be  more  interesting  than  that 
defined  by  topological  analogy,  because  it  maximises  link  usage,  increasing  the 
output  width  from  source/destination,  not  only  along  the  message  path.  Therefore,  a 
methodology  for  Supemode  definition  has  been  created  for  each  topology.  DRB 
offers  a  set  of  alternative  paths  to  choose  from,  depending  on  the  trade-offs  between 
throughput  and  latency. 

The  second  component  of  DRB  are  the  policies  to  select  a  specific  Supernode.  The 
dynamic  policy  we  present  monitors  traffic  load  and  dynamically  configures 
Supemode  parameters  depending  on  the  temporary  requirements  of  message  load  in 
the  network.  The  policy  does  not  waste  significant  computation  or  communication 
resources  because  they  are  fully  distributed,  and  the  monitoring  and  decision 
overhead  is  linearly  dependent  of  the  number  of  messages  in  the  network.  DRB  is 
useful  for  persistent  communication  patterns  which  are  the  ones  which  can  cause  the 
worst  hotspot  situations.  The  evaluation  done  to  validate  DRB  shows  us  very  good 
improvements  in  latency,  effectively  eliminating  hotspots  from  the  network. 
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Abstract.  The  objective  of  this  work  is  to  obtain  the  dominant  A-niodes 
of  a  nuclear  power  reactor.  This  is  a  real  generalized  eigenvalue  problem, 
which  can  be  reduced  to  a  standard  one.  The  method  used  to  solve  it 
has  been  the  Implicitly  Restarted  Arnold!  (IRA)  method.  Due  to  the 
dimensions  of  the  matrices,  a  parallel  approach  has  been  proposed,  im¬ 
plemented  and  ported  to  different  platforms.  This  includes  the  devel¬ 
opment  of  a  parallel  iterative  linear  system  solver.  To  obtciin  the  best 
performemce,  care  must  be  taken  to  exploit  the  structure  of  the  matri¬ 
ces. 

Keywords.  Parrdlel  computing,  eigenproblems,  lambda  modes 


1  Introduction 

The  generalized  algebraic  eigenvalue  problem  is  a  standard  problem  that  fre¬ 
quently  arises  in  many  fields  of  science  and  engineering.  In  particular,  it  appears 
in  approximations  of  differential  operators  eigenproblems.  The  application  pre¬ 
sented  here  is  taken  from  nuclear  engineering. 

The  analysis  of  the  lambda  modes  are  of  great  interest  for  reactor  safety  and 
modal  analysis  of  neutron  dynamical  processes.  In  order  to  study  the  steady 
state  neutron  flux  distribution  inside  a  nuclear  power  reactor  and  the  sub-critical 
modes  responsible  for  the  regional  instabilities  produced  in  the  reactors,  it  is  nec¬ 
essary  to  obtain  the  dominant  A-modes  and  their  corresponding  eigenfunctions. 

The  discretization  of  the  problem  leads  to  an  algebraic  eigensystem  which 
can  reach  considerable  sizes  in  real  cases.  The  main  aim  of  this  work  is  to  solve 
this  ijroblem  by  using  appropriate  numerical  methods  and  introducing  High 
Performance  Computing  techniques  so  that  response  time  can  be  reduced  to 
the  minimum.  In  addition  to  this,  other  benefits  can  be  achieved,  as  well.  For 
example,  greater  problems  can  be  faced  and  a  better  precision  in  the  results  can 
be  attained. 

This  contribution  is  organized  as  follows.  Section  2  is  devoted  to  present 
how  the  algebraic  generalized  eigenvalue  problem  is  derived  from  the  neutron 
diffusion  equation.  In  section  3,  a  short  description  of  the  matrices  which  arise 
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in  this  problem  is  done.  Section  4  reviews  the  method  used  for  the  solution  of 
the  eigenproblem.  Section  5  describes  some  implementation  issues,  whereas  in 
section  6  the  results  are  summarized.  Finally,  the  main  conclusions  are  exposed 
in  section  7. 

2  The  Neutron  Diffusion  Equation 

Reactor  calculations  are  usually  based  on  the  multigroup  neutron  diffusion  equa¬ 
tion  [12].  If  this  equation  is  modeled  with  two  energy  groups,  then  the  problem 
we  have  to  deal  with  is  to  find  the  eigenvalues  and  eigenfunctions  of 

C4>i  =  ,  (1) 

where 


— V  (Di  V)  -I-  EaX  +  Ti2  0 

-i:i2  -V{D2V)  +  Ea2 


ui  Efi 

viSfi 

and 

0 

0 

Uu\ 

with  the  boundary  conditions  =  0,  where  F  is  the  reactor  border. 

For  a  numerical  treatment,  this  equation  must  be  discretized  in  space.  Nodal 
methods  are  extensively  used  in  this  case.  These  methods  are  b2ised  on  approx¬ 
imations  of  the  solution  in  each  node  in  terms  of  an  adequate  base  of  functions, 
for  example,  Legendre  polynomials  [9].  It  is  assumed  that  the  nuclear  properties 
are  constant  in  every  cell.  Finally,  appropriate  continuity  conditions  for  fluxes 
and  currents  are  imposed. 

This  process  allows  to  transform  the  original  system  of  partial  differential 
equations  into  an  algebraic  large  sparse  generalized  eigenvalue  problem 

where  L  and  M  are  matrices  of  order  2N  with  the  following  N’-dimensional  block 
structure 


being  Ln  and  L22  nonsingular  sparse  symmetric  matrices,  and  Mu,  M12  and  Lox 
diagonal  matrices.  By  eliminating  ^2,  )  we  obtain  the  following  A^-dimensional 
non-symmetric  standard  eigenproblem 

,4Vq,-  =  XjtJ’i.  , 

where  the  matrix  A  is  given  by 

{Mil  +  Mi-2L.2-2  L21)  ■  (3) 

All  the  eigenvalues  of  this  equation  are  real.  We  are  only  interested  in  calcu¬ 
lating  a  few  dominant  ones  with  the  corresponding  eigenvectors. 

This  problem  has  been  solved  with  numerical  methods  such  as  Subspace 
Iteration,  [10],  [11].  Here  we  present  a  more  effective  strategy. 
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3  Matrix  Features 


In  order  to  validate  the  correctness  of  the  implemented  programs,  two  reactors 
have  been  chosen  as  test  cases. 

The  first  benchmark  is  the  BIBLIS  reactor  [3],  which  is  a  pressure  water 
reactor  (PWR).  Due  to  its  characteristics,  this  reactor  has  been  modeled  in  a 
bidimensional  fashion  with  1  /4  symmetry.  The  nodalization  scheme  is  shown  in 
figure  1(a).  The  darkest  cells  represent  the  23.1226  cm  wide  reflector  whereas 
the  other  cells  correspond  to  the  reactor  kernel,  with  a  distance  between  nodes 
of  23.1226  cm  as  well.  The  kernel  cells  can  be  of  up  to  7  different  materials. 


:  . 

pi 

(a)  Nodalization 


Fig.  1.  The  BIBLIS  benchmark. 


The  numbers  of  the  nodes  follow  a  left-right  top-down  ordering,  as  shown 
in  the  figure.  This  leads  to  a  staircase-like  matrix  pattern  which  can  be  seen 
in  figure  1(b).  Note  that  only  the  upper  triangular  part  is  stored.  This  pattern 
is  identical  for  both  Ln  and  L22-  In  particular,  the  matrix  pattern  depicted  in 
the  figure  corresponds  to  a  space  discretization  using  Legendre  polynomials  of 
5th  degree.  This  degree  is  directly  related  to  the  number  of  rows  and  columns 
associated  to  every  node  in  the  mesh.  The  dimensions  of  the  matrices  are  shown 
in  table  1. 

The  other  reference  case  is  the  RINGHALS  reactor  [4],  a  real  boiling  water 
reactor  (BWR).  This  reactor  has  been  discretized  three-dimensionally  in  27  axial 
planes  (25  for  the  fuel  and  2  for  the  reflector).  In  its  turn,  each  axial  plane  is 
divided  in  15.275  cm  x  15.275  cm  cells  distributed  as  shown  in  figure  2(a).  In 
this  case,  it  is  not  possible  to  simplify  the  problem  because  the  reactor  does  not 
have  any  symmetry  by  planes.  Each  of  the  15600  cells  has  different  neutronic 
properties.  As  expected,  matrices  arising  from  this  nodalization  scheme  are  much 
larger  and  have  a  much  more  regular  structure  (figure  2(b)). 

Apart  from  symmetry  and  sparsity,  the  most  remarkable  feature  of  the  ma¬ 
trices  is  bandedness.  This  aspect  has  been  exploited  in  the  implementation. 


937 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


(a)  Nodalization 


Fig.  2.  The  RINGHALS  benchmajk. 


In  table  1  several  properties  of  the  matrices  are  listed  to  give  an  idea  of 
the  magnitude  of  the  problem.  In  this  table,  dpol  is  the  degree  of  Legendre 
polynomials,  n  is  the  dimension  of  the  matrix,  nz  is  the  number  of  non-zero 
elements  stored  the  upper  triangle,  bw  is  the  upper  bandwidth,  and  disp  is  the 
percentage  of  non-zero  values  with  respect  to  the  whole  matrix.  Finally,  the 
storage  requirements  for  the  values  are  given  {mem)  to  emphasize  this  issue. 


dpol 

n 

nz 

mem 

bw 

disp 

1 

73 

201 

1.6  Kb 

10 

0.062 

2 

219 

1005 

7.9  Kb 

29 

0.037 

Biblis 

3 

438 

2814 

22  Kb 

57 

0.027 

4 

730 

6030 

47  Kb 

94 

0.021 

5 

1095 

11055 

86  Kb 

140 

0.017 

1 

20844 

80849 

0.6  Mb 

773 

0.00032 

2 

83376 

505938 

3.9  Mb 

3090 

0.00013 

Ringhals 

3 

208440 

1721200 

13  Mb 

7723 

0.000074 

4 

416880 

4355110 

33  Mb 

15444 

0.000048 

5 

729540 

9218685 

70  Mb 

27025 

0.000033 

Table  1.  Several  properties  of  the  matrices. 


4  Implicitly  Restarted  Arnold!  Method 

As  we  only  want  to  determine  the  dominant  eigenvalues  which  define  the  reactor 
behaviour,  we  approach  this  partial  eigenproblem  with  the  Arnoldi  method. 

The  Arnoldi  method  is  a  Krylov  subspace  or  orthogonal  projection  method 
for  extracting  spectral  information.  We  call 

ICk{A,vo)  =  span{uo,4uo,A'-^t;o,---,A''“^t;o} 
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the  /c-th  Krylov  subspace  corresponding  to  A  €  and  vq  €  C”.  The  idea  is 
to  construct  approximate  eigenvectors  in  this  subspace. 

We  define  a  fc-step  Arnoldi  factorization  of  A  as  a  relationship  of  the  form 

AV  =  VH  +  jel 

where  V  6  has  orthonormal  columns,  f  =  0,  and  H  e  is  upper 
Hessenberg  with  a  non-negative  sub-diagonal.  The  central  idea  behind  this  fac¬ 
torization  is  to  construct  eigenpairs  of  the  large  matrix  A  from  the  eigenpairs  of 
the  small  matrix  H. 

In  general,  we  would  like  the  starting  vector  uq  to  be  rich  in  the  directions  of 
the  desired  eigenvectors.  In  some  sense,  as  we  get  a  better  idea  of  what  the  desired 
eigenvectors  are,  we  would  like  to  adaptively  refine  vq  to  be  a  linear  combina¬ 
tion  of  the  approximate  eigenvectors  and  restart  the  Arnoldi  factorization  with 
this  new  vector  instead.  A  convenient  and  stable  way  to  do  this  without  explic¬ 
itly  computing  a  new  Arnoldi  factorization  is  given  by  the  Implicitly  Restarted 
Arnoldi  (IRA)  method,  based  on  the  implicitly  shifted  QR  factorization  [8]. 

The  idea  of  the  IRA  method  is  to  extend  a  A:-step  Arnoldi  fax;torization 

AVk  =  V,Hk  +  fkel 

to  a  (A:  -f  p)-step  Arnoldi  factorization 

AKfc-fp  ~  Vk+pHk+p  +  fk+p^k+p  • 

Then  p  implicit  shifts  are  applied  to  the  factorization,  resulting  in  the  new  fac¬ 
torization 

AV+  =  V+H+  -I-  fk+p^T+pQ  1 

where  V+  =  Vk+pQ,  H+  =  Q^Hk+pQ,  and  Q  =  QiQ^  'Qp,  where  Q,  is 
associated  with  factoring  {H  —  ail)  =  QiRi-  It  turns  out  that  the  first  k  —  1 
entries  of  ek+pQ  are  zero,  so  that  a  new  fc-step  Arnoldi  factorization  can  be 
obtained  by  equating  the  first  k  columns  on  each  side: 

We  can  iterate  the  process  of  extending  this  new  fc-step  factorization  to  a 
{k  -I-  p)-step  factorization,  applying  shifts,  and  condensing.  The  payoff  is  that 
every  iteration  implicitly  applies  a  degree  polynomial  in  A  to  the  initial 
vector  Vq.  The  roots  of  the  polynomial  are  the  p  shifts  that  were  applied  to 
the  factorization.  Therefore,  if  we  choose  as  the  shifts  aj  eigenvalues  that  are 
“unwanted”,  we  can  effectively  filter  the  starting  vector  vq  so  that  it  is  rich  in 
the  direction  of  the  “wanted”  eigenvectors. 

5  Implementation 

This  section  describes  some  details  of  the  implemented  codes  for  a  distributed 
memory  environment.  The  Message  Passing  Interface  (MPI),  [2],  has  been  used 
as  the  message  passing  layer  so  that  portability  is  guaranteed. 
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5.1  Eigensolver  Iteration 

The  Implicitly  Restarted  Arnoldi  (IRA)  method  which  has  been  used  in  this  work 
is  that  implemented  in  the  ARPACK  [5]  software  package.  This  package  contains 
a  suit  of  codes  for  the  solution  of  several  types  of  eigenvalue  related  problems, 
including  standard  and  generalized  eigenproblems  and  singular  value  decom¬ 
positions  for  both  real  and  complex  matrices.  It  implements  the  IRA  method 
for  non-symmetric  matrices  and  the  analogous  Lanczos  method  for  symmetric 
matrices. 

In  particular,  the  programs  implemented  for  the  calculation  of  the  lambda 
modes  make  use  of  the  parallel  version  of  ARPACK  [6],  which  is  oriented  to  a 
SPMD/MIMD  programming  paradigm.  This  package  uses  a  distribution  of  all 
the  vectors  involved  in  the  algorithms  by  blocks  among  the  available  processors. 
The  size  of  the  block  can  be  established  by  the  user,  thus  allowing  more  flexibility 
for  load  balancing. 

As  well  as  in  many  other  iterative  methods  packages,  arpack  subroutines 
are  organized  in  a  way  that  they  offer  a  reverse  communication  interface  to 
the  user  [1].  The  primary  aim  of  this  scheme  is  to  isolate  the  matrix- vector 
operations.  Whenever  the  iterative  method  needs  the  result  of  an  operation 
such  as  a  matrix-vector  product,  it  returns  control  to  the  user’s  subroutine  that 
called  it.  After  performing  this  operation,  the  user  invokes  the  iterative  method 
subroutine  again. 

The  flexibility  of  this  scheme  gives  the  possibility  of  using  various  matrix 
storage  formats  as  well  as  obtaining  the  eigenvalues  and  eigenvectors  of  a  matrix 
for  which  an  explicit  form  is  not  available.  Indeed,  the  problem  we  are  presenting 
here  is  a  non-symmetric  standard  partial  eigenproblem  of  an  operator  given  by 
the  expression  (3),  where  A  is  not  calculated  explicitly. 

The  explicit  construction  of  the  inverses  would  imply  the  loss  of  sparsity  prop¬ 
erties,  thus  making  the  storage  needs  prohibitive.  For  this  reason,  the  matrix- 
vector  product  needed  in  the  Arnoldi  process  has  to  be  calculated  by  performing 
the  operations  which  appear  in  (3)  one  at  a  time.  The  necessary  steps  to  compute 
y  =  Ax  are  the  following; 

1.  Calculate  Wi  =  Mux. 

2.  Calculate  —  L21X. 

3.  Solve  the  system  =  w-z  for  103. 

4.  Calculate  =  juj  -I-  Miziu^. 

5.  Solve  the  system  Luy  =  for  y. 

It  has  to  be  noted  that  the  above  matrix-vector  products  involve  only  diagonal 
matrices.  Therefore,  the  most  costly  operations  are  the  solution  of  linear  systems 
of  equations  (steps  3  and  5). 

5.2  Linear  Systems  of  Equations 

The  resolution  of  the  linear  systems  can  be  approached  with  iterative  methods, 
such  as  the  Conjugate  Gradient  [7].  These  methods,  in  their  turn,  typically  use 
the  aforementioned  reverse  communication  scheme. 
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For  the  parallel  implementation,  two  basic  operations  have  to  be  provided, 
namely  the  matrix-vector  product  and  the  dot  product. 

In  this  case,  the  matrix-vector  product  subroutine  deals  with  sparse  sym¬ 
metric  matrices  [La).  The  parallel  implementation  of  this  operation  is  described 
later.  For  the  distributed  dot  product  function,  each  processor  can  perform  a 
dot  product  on  the  sub-vectors  it  has,  and  then  perform  a  summation  of  all  the 
partial  results. 

In  order  to  accelerate  the  convergence,  a  Jacobi  preconditioning  scheme  was 
used  because  of  its  good  results  and  also  because  its  parallelization  is  straight¬ 
forward. 


5.3  Parallel  Matrix- Vector  Product 

The  matrices  involved  in  the  systems  of  equations  {La)  are  symmetric,  sparse 
and  with  their  nonzero  elements  within  a  narrow  band.  This  structure  must  be 
exploited  for  an  optimal  result.  The  storage  scheme  used  has  been  Compressed 
Sparse  Row  containing  only  the  upper  triangle  elements  including  the  main 
diagonal. 

A  multiplication  algorithm  which  exploits  the  symmetry  can  view  the  product 
LiiX  =  ij  Bs{U  +  [La  -  U))x  =  y,  where  U  is  the  stored  part.  Thus,  the  product 
can  be  written  as  the  addition  of  two  partial  products, 

^l^+{Lii  -  U)x  =  y  . 

yi  j,2 

The  algorithm  for  the  first  product  can  be  row-oriented  {yi  =  U^^'^x)  whereas 
the  other  must  be  column-oriented  {y  =  0,  y  =  y  +  Xi{A  -  U)i  =  y  + 

The  parallel  matrix-vector  product  has  to  be  highly  optimized  in  order  to 
achieve  good  overall  performance,  since  it  is  the  most  time-consuming  opera¬ 
tion.  The  matrices  have  been  partitioned  by  blocks  of  rows  conforming  with  the 
partitioning  of  the  vectors.  It  has  been  implemented  in  three  stages,  one  initial 
communication  stage,  parallel  computation  of  intermediate  results  and  finally 
another  communication  operation.  This  last  stage  is  needed  because  only  the 
upper  triangular  part  of  the  symmetric  matrices  is  stored.  In  the  communica¬ 
tion  stages,  only  the  minimum  amount  of  information  is  exchanged  and  this  is 
done  synchronously  by  all  the  processors. 

Figure  3  shows  a  scheme  of  the  parallel  matrix-vector  product  for  the  case 
of  4  processors.  The  shadowed  part  of  the  matrix  corresponds  to  the  elements 
stored  in  proces.sor  1.  Before  one  processor  can  carry  out  its  partial  product, 
a  fragment  of  the  vector  operand  owned  by  the  neighbour  processor  is  needed. 
Similarly,  after  the  partial  product  is  complete,  part  of  the  solution  vector  must 
be  exchanged. 

The  detailed  sequence  of  operations  in  processor  i  is  the  following: 

1.  Receive  from  processor  i  -I-  1  the  necessary  components  of  x.  Also  send  to 
processor  i  —  \  the  corresponding  ones. 
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Fig.  3.  Message  passing  scheme  for  the  matrix-vector  product. 


2.  Compute  the  partial  result  of  y. 

3.  Send  to  processor  i  +  1  the  fragment  of  y  assigned  to  it.  Get  from  processor 
i  —  1  the  part  which  must  be  stored  in  the  local  processor,  as  well. 

4.  Add  the  received  block  of  y  to  the  block  calculated  locally,  in  the  appropriate 
positions. 

The  communication  steps  1  and  3  are  carried  out  synchronously  by  all  the 
processors.  The  corresponding  MPI  primitives  have  been  used  in  order  to  mini¬ 
mize  problems  such  as  network  access  contention. 

6  Results 

The  platforms  on  which  the  code  has  been  tested  are  a  Sun  Ultra  Enterprise 
4000  multiprocessor  and  a  cluster  of  Pentium  II  processors  at  300  MHz  with  128 
Mb  of  memory  each  connected  with  a  Fast  Ethernet  network.  The  code  has  been 
ported  to  several  other  platforms  as  well. 

Several  experiments  have  shown  that  the  most  appropriate  method  for  the 
solution  of  the  linear  systems  in  this  particular  problem  is  the  Conjugate  Gradi¬ 
ent  (CG)  with  .Jacobi  preconditioning.  Other  iterative  methods  have  been  tested, 
including  BGC,  BiCGStab,  TFQMR  and  GMRES.  In  all  the  cases,  the  perfor¬ 
mance  of  these  solvers  has  turned  out  to  be  worse  than  that  of  CG.  Table  2 
compares  the  average  number  of  LuXk  products  and  the  time  (in  seconds)  spent 
by  the  programs  in  the  solution  of  the  Ringhals  benchmark  {dpol  =  1)  for  each  of 
the  tested  methods.  Apart  from  the  response  time,  it  can  be  observed  also  in  this 
table  that  CG  is  the  solver  which  requires  less  memory.  The  time  corresponding 
to  eight  processors  is  also  included  to  show  that  the  efficiency  (Ep)  is  modified 
slightly. 
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Method 

Time  (p=l) 

Time  (p=8) 

E,  (%) 

Memory 

CG 

12 

102.04 

15.12 

84 

bn 

BCG 

22 

174.32 

24.84 

88 

In 

BiCGStab 

15 

129.99 

19.03 

85 

%n 

TFQMR 

13 

155.59 

22.75 

86 

lln 

GMRES(3) 

16 

210.79 

30.49 

86 

~  5n 

Table  2.  Comparison  between  several  iterative  methods. 


6.1  Adjustment  of  Parameters 

When  applied  to  this  particular  problem,  the  IRA  method  is  a  convergent  process 
in  all  the  studied  cases.  However,  it  is  not  an  approximate  method  in  the  sense 
that  the  precision  of  the  obtained  approximate  solution  depends  to  a  great  extent 
on  the  tolerance  demanded  in  the  iterative  process  for  the  solution  of  linear 
systems  of  equations.  Table  3  reflects  the  influence  of  the  precision  required  in 
the  Conjugate  Gradient  process  [WLux  -  b\\2  <  tolca)  in  values  such  as  average 
number  of  matrix- vector  products  {LuXk),  number  of  Arnoldi  iterations  (IRA) 
and  precision  of  the  obtained  eigenvalue.  It  can  be  observed  that  the  number  of 
significant  digits  in  the  approximate  solution  Ai  matches  the  required  precision 
t.olcG  and  that,  after  a  certain  threshold  (10“^),  the  greater  the  tolerance,  the 
worse  the  approximate  solution  is. 


tolcG 

■^7  7 

IRA 

Ax 

Time 

Ai 

llAx  -  At||/|A| 

10“^ 

33 

19 

60 

157.59 

1.012189 

0.000096 

10“* 

29 

19 

60 

139.33 

1.012189 

0.000096 

10“^ 

25 

19 

60 

120.48 

1.012189 

0.000096 

10“® 

21 

19 

60 

102.16 

1.012189 

0.000097 

10“^ 

17 

19 

60 

85.71 

1.012189 

0.000101 

0.0001 

13 

19 

60 

67.16 

1.012170 

0.000354 

0.001 

9 

23 

72 

57.47 

1.011175 

0.004809 

0.01 

5 

38 

117 

58.14 

1.003579 

0.033075 

0.1 

2 

14 

44 

12.04 

0.801034 

0.297213 

Table  3.  Influence  of  the  CG  tolerance  in  the  precision  of  the  final  result. 


In  conclusion,  the  tolerance  to  be  demanded  in  both  the  IRA  and  CG  methods 
should  be  of  the  same  magnitude.  In  our  case,  it  is  sufficient  to  choose  10“^  and 
10“''’,  respectively,  since  input  data  are  perturbed  by  inherent  errors  of  the  order 
ofl0“®. 

Another  parameter  to  be  considered  is  the  number  of  columns  of  V  (ncv) ,  that 
is,  the  maximum  dimension  of  the  Arnoldi  basis  before  restart.  This  value  has  a 
great  influence  in  the  effectiveness  of  the  IRA  method,  either  in  its  computational 
cost  as  well  as  in  the  memory  requirements.  If  ncv  takes  a  great  value,  the 
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computational  cost  per  iteration  and  the  storage  requirements  can  be  prohibitive. 
However,  a  small  value  can  imply  that  the  constructed  Krylov  subspace  contains 
too  few  information  and,  consequently,  the  process  needs  too  many  iterations 
until  convergence  is  achieved.  Some  experiments  reveal  that,  in  this  particular 
application,  a  value  of  2  or  3  times  the  number  of  desired  eigenvalues  gives  good 
results. 

Figure  4  shows  the  number  of  matrix- vector  operations  {Axk)  necessary  to 
achieve  convergence  for  different  values  of  ncv.  The  experiment  has  been  repeated 
for  a  value  of  desired  eigenvalues  (nev)  ranging  from  1  to  6.  In  the  case  of  only 
one  eigenvalue,  it  is  clear  that  a  small  subspace  dimension  makes  the  convergence 
much  slower.  On  the  other  hand,  incrementing  ncv  beyond  some  value  is  useless 
with  respect  to  convergence  while  increasing  the  memory  requirements. 


Fig.  4.  Influence  of  ncv  in  the  convergence  of  the  algorithm. 


In  the  other  five  cases,  one  can  observe  a  similar  behaviour.  However,  the 
lines  are  mixed,  mainly  in  the  next  three  eigenvalues.  This  is  caused  by  the 
known  fact  that  clusters  of  eigenvalues  affect  the  convergence.  Figure  5  shows  the 
convergence  history  of  the  first  6  eigenvalues.  In  this  graphic  it  can  be  appreciated 
that  eigenvalues  2,  3  and  4  converge  nearly  at  the  same  time,  because  they  are 
very  close  to  each  other  (see  table  4). 

6.2  Speed-up  and  Efficiency 

In  figure  6  we  give  the  performance  of  the  parallel  eigensolver  measured  in  the 
Sun  multiprocessor.  The  graphics  show  the  execution  time  (in  seconds),  speed-up 
and  efficiency  with  up  to  eight  processors  for  each  of  the  five  matrices  correspond¬ 
ing  to  the  Ringhals  benchmark.  The  Biblis  reactor  has  not  been  considered  for 
measurement  of  parallel  performance  because  of  its  small  size. 
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Converged 

eigenvalues 


0  20  40  60  80  100  120  140  160  180  Iterations 

Fig.  5.  Convergence  history  of  the  first  6  dominant  eigenvalues  of  the  Ringhals  case 
(dpol=2). 


A,: 

II Ax  -  A,x||/|A,  | 

Ai 

1.012190 

0.000038 

A2 

1.003791 

0.000042 

A3 

1.003164 

0.000043 

A4 

1.000775 

0.000036 

As 

0.995716 

0.000049 

Ae 

0.993402 

0.000096 

Table  4.  Dominant  eigenvalues  of  the  Ringhals  benchmark  (dpol=2). 


In  these  graphics,  it  can  be  seen  that  the  speed-up  increases  almost  linearly 
with  the  number  of  processors.  The  efficiency  does  not  fall  below  75%  in  any  case. 
We  can  expect  these  scalability  properties  to  maintain  with  more  processors. 

The  discretization  with  polynomials  of  2nd  degree  gives  a  reasonably  accurate 
approximation  for  the  Ringhals  benchmark.  In  this  case,  the  response  time  with 
8  processors  is  about  70  seconds. 

In  the  case  of  the  cluster  of  Pentiums,  the  efficiency  reduces  considerably 
as  a  consequence  of  a  much  slower  communication  system.  However,  the  results 
are  quite  acceptable.  Figure  7  shows  a  plot  of  the  efficiency  attained  for  the 
five  cases  of  the  Ringhals  benchmark.  It  can  be  observed  that  for  the  biggest 
matrix,  the  efficiency  is  always  greater  than  50%  even  when  using  8  computers, 
corresponding  to  a  speed-up  of  4. 

Another  additional  advantage  in  the  case  of  personal  computers  must  be 
emphasized:  the  utilization  of  the  memory  resources.  The  most  complex  cases 
(Ringhals  with  dpol  =  4  and  dpol  =  5)  can  not  be  run  on  a  single  c:omputer 
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(c)  Efficiency 

Fig.  6.  Graphics  of  perMfeance  of  the  eigensolv 
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because  of  the  memory  requirements.  Note  that  the  memory  size  of  the  running 
process  can  be  more  than  250  Mb.  The  parallel  approach  gives  the  possibility  to 
share  the  memory  of  several  computers  to  solve  a  single  problem. 


Fig.  7.  Efficiency  obtained  in  a  cluster  of  PC’s. 


7  Conclusions 

In  this  work,  we  have  developed  a  parallel  implementation  of  the  Implicitly 
Restarted  Arnoldi  method  for  the  problem  of  obtaining  the  A-modes  of  a  nuclear 
reactor.  This  parallel  implementation  can  reduce  the  response  time  significantly, 
specially  in  complex  realistic  cases.  The  nature  of  the  problem  has  forced  to 
implement  a  parallel  iterative  linear  system  solver  as  a  part  of  the  solution 
process.  With  regard  to  scalability,  the  experiments  with  up  to  8  processors 
have  shown  a  good  behaviour  of  the  eigensolver. 

Apart  from  the  gain  in  computing  time,  another  advantage  of  the  parallel 
implementation  is  the  possibility  to  cope  with  larger  problems  using  inexpensive 
platforms  with  limited  physical  memory  such  as  networks  of  personal  computers. 
With  this  approach,  the  total  demanded  memory  is  evenly  distributed  among 
all  the  available  processors. 

The  following  working  lines  axe  being  considered  for  further  refinement  of 
the  programs: 

—  Store  the  solution  vectors  of  the  linear  systems  to  use  them  as  initial  solution 
estimations  in  subsequent  IRA  iterations. 

-  Begin  the  Arnoldi  process  taking  as  initial  estimated  solution  an  extrapola¬ 
tion  of  the  solution  of  the  previous  order  problem. 
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-  Use  a  more  appropriate  numbering  scheme  for  the  nodes  of  the  grid  in  order 
to  reduce  the  bandwidth  of  the  matrices  and,  consequently,  reduce  the  size 
of  the  messages  exchanged  between  processors. 

-  Implement  a  parallel  direct  solver  for  the  linear  systems  of  equations,  instead 
of  using  an  iterative  method.  This  should  be  combined  with  the  reduction 
of  bandwidth  so  that  the  fill-in  is  minimized. 

-  Consider  the  more  general  case  of  having  G  energy  groups  in  the  neutron 
diffusion  equation,  instead  of  only  2.  In  this  case,  it  would  probably  be  more 
effective  to  approach  the  problem  as  a  generalized  eigensystem. 

-  Consider  also  the  general  dynamic  problem.  In  this  case,  the  solution  is 
time-dependent  and  a  convenient  way  of  updating  it  would  be  of  interest. 
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Abstract.  The  Markov  Decision  Control  Problem  of  optimal  task  distribution  in 
computer  network  is  formulated.  The  open-loop  and  closed-loop  control  strategies 
based  on  the  stochastic  forecast  are  presented.  Numerical  tests  for  the  mixed  FE/FD 
scheme  performed  on  the  heterogenepus  network  exhibit  the  behavior  of  different 
strategies  for  various  size  of  data  to  be  processed. 


Keywords:  stochastic  control,  distributed  computation,  optimal  task  distribution,  Mar¬ 
kov  Decision  Process 


1  Master- Slave  Application  and  Heterogeneous  Network  Model 

The  heterogeneous  computer  network  and  distributed  application  is  a  tuple  <  H,  A,  F,  Ga  > 
([8]).  where: 

•  H  =  {Mo....,A/m}  is  a  set  of  m -h  1  distinct  machines  which  may  communicate 
with  any  other  one,  may  have  a  different  architecture  (sequential,  vector,  parallel  etc.) 
and  performance  parameters  for  each  machine  may  vary  in  time.  We  assume  that  the 
computer  network  is  nondecomposable  [1]. 

•  A  is  a  master-slave  application,  which  consists  of  sequential  part  to  (master  task) 
and  distributed  part  of  N  identical  tasks  (slaves)  Vd  —  {ti,  • .  -  An]-  Additionaly.  we 
assume  that  tasks  are  constrained  to  be  atomic. 

•  Ca  =  {y,E),  is  a  scheduled  task  graph  for  the  application  A  [1];  E  C  V  x  V,  V  = 
Vd  U  {to},  [liAk]  £  E  iff  ti  must  execute  before  the  tk  due  to  data  dependence  or 
task  synchronization  requirements  of  A.  Without  loss  of  generality,  we  assume  that  Ga 
has  unique  entry  and  exit  nodes  identified  with  the  task  to,  and  for  any  task  t,  there 
is  a  path  from  to  to  t,-  and  a  path  from  ti  to  to- 

•  F  is  a  mapping  F  :  V'  — ^  H  which  defines  the  Ga  and  which  in  case  of  the  dynamic 
tasks  allocation  is  defined  as  a  sequence  of  partial  mappings,  {F„}„=oa....<  i-<?. 

U  FF\Mi)  =  F-\Mi)  \fMi  G  H  F-^(Mo)  =  to  (1) 


2  Estimation  of  State  of  Background  Workload 

The  considered  network  H  is  composed  of  the  computers,  which  resources  are  shared 
between  several  users  (multi-user  systems).  Therefore,  processes  may  appear  randomly 
on  each  machine  M  €  H.  so  the  background  load  of  M  and  background  load  of  the 
communication  netw-ork  are  changing  in  time  and  their  changes  have  significant  influence 
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on  the  large  applications  efficiency  computed  in  the  distributed  network  H .  For  that  reason 
the  background  load  of  the  machine  and  communication  network  is  allowed  to  be  estimated 
in  the  stochastic  way.  The  advantage  of  such  model  is  oportunity  to  define  the  functional 
dependencies  between  system  performance  parameters  and  workload  parameters  which 
gives  us  a  knowledge  about  future  evaluation  of  those  parameters.  Only  the  stochastic 
model  for  the  background  load  of  the  network  H  was  described  in  this  paper  considering 
the  identical  formal  description  for  communication  network  load. 

Let  loadj  £  be  the  additional  background  load  for  the  machine  Mj  £  H .  Each 
value  loadj  corresdponds  with  the  execution  slowing  down  parameter  rjj  £  71+  (defined 
in  [7][8])  (i.e.  =  (1  +  T]j)  ■  T{ ,  where  and  T/  are  the  pattern  task  execution  times 

with  and  without  additional  background  load  respectively).  So,  father  we  w’ill  consider  the 
bijection  il’j  '■  T^+  3  loadj  —<■  rjj  £  71+ . 

Let  consider  &j  and  Fj  two  ordered  sets  of  thresholds: 

0j  =  (0  G  71+ ,  i  =  0, . . Kj ,  Kj  £  Af  : 

Vf  =  0. . . . ,  Kj  -  1  Oj (i)  <  Bj  (f  +  1)  }  (2) 


=  {Viil)  G  t  =  0, .  ..,Kj,  Kj  £  Al'  : 

Vi  =  0, ... ,  Kj  -  1  r]jii)  <  T)j{i  +  1)  }  (3) 

first  one  for  the  physical  measured  performance  parameters  (like  loadj )  and  the  second 

one  for  the  slowing  down  parameters,  and  such  that  VMj  E  H  Vi  =  0 . Aj  E;(^j  (i))  = 

i]j(i).  So.  VMj  £  A  we  obtained  the  set  Sj  =  {0, . . .,  Kj],  which  will  be  called  the  set  of 
states  of  the  background  load. 

Let  define  the  mapping  'Fj  :  7Z+  —*  Sj  as: 


'  0  dla  Tjj(O)  <T]j  <  r)j{l), 
1  dla  7?j(l)  <  T}j  <  rjj{2). 


(  K  dla  rij(K)  <  <  oc 


(4) 


where  Vj  rij  =  ipj (loadj). 

The  machine  Mj  £  H  is  in  the  state  i  £  {0,...,A'^}  of  the  background  load  iff 
Fj  o  rl'j (loadj)  =  i.  Moreover.  VMj  £  H  let  define  bijection  :  Sj  3  i  —  »7j(i)  G  Fj. 

3  Model  of  Background  Workload 

As  the  model  of  background  load  we  assume  a  tuple  <  p,  {A",,  }n=o.i...-F.  F.6.  S.  >  ([8]), 
where: 


•  p  is  a  tuple  <  pi .... ,  p,„  >  wdiere  'if  —  1, . . . ,  77'!  pj  =  (1?.  T,  Pl )  is  a  probability  space 
where  f?  is  a  set  of  events  that  M  admits  some  average  value  of  the  execution  slowing 
down  parameter  in  the  for  some  n,  identical  for  each  Mj  ([7]).  In  consequence  each 
evant  corresponds  with  a  unique  state  from  S. 
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•  {A'n}n=o,i...  is  a  tuple  <  ■  ■  ■ ,  >  of  nonstationary  discrete 

stochastic  Markov  chains  ([7])  which  describe  dynamic  behavior  of  state  of  background 
load  for  each  Mj  €  H  in  time.  The  dynamics  is  given  by: 


=  i|A-„*  =  i)  =  =  j\X^  =  io, . . A'Li  =  ^'n-i 


,  A": 


(5) 


Each  Xl  :  Q  S  corresponds  to  the  time  interval  An,  n  =  0, 1, . .  .. 

•  P  is  a  tuple  <  >  where  =  {E'^}n=o,i...  is  a  sequence  of  Markov  transi¬ 

tion  matrices.  The  p\j{n)  —  P^(Aj^'+i  =  j\X^  =  i)  is  a  probability  that  process  will  be 
in  state  j  of  background  load  at  the  step  n-|- 1  provided  that  it  is  in  state  i  at  step  n  for 
the  machine  M*.  If  i7(n)  =  . . . ,  i7”’(n))  denotes  the  vector  of  the  state  proba¬ 
bility  distributions  at  the  step  n  for  the  network  H  and  ,  /d'" )  is 

the  initial  distribution  then  the  consecutive  distributions  may  be  evaluated  according 
to  the  formula  [6]:  (n)  =  {n  —  1)  •  P^,  n  >  p. 

m  P  is  a  tuple  <  Pi,. Pm  >  where  Pj  is  the  following  set  of  tresholds  for  the  slowing 
down  parameters: 


Pj  =  iVjU)  6  P+,  f  =  0, . . . ,  A',  :  Vz  =  0, . . . ,  A'  -  1  r]j(i)  <  r]j{i  -j-  1)  }  (6) 

•  0  is  a  tuple  <  0i , . . . ,  ©m  >  where  &j  is  the  following  set  of  tresholds  for  the  physical 
measured  performance  parameters  (like  loadj)  for  the  machine  Mj: 

£n+,  z  =  0,...,A,  :Vi  =  0,...,A-l  dj(i)  <  9j(i  +  1)}  (T) 

Moreover,  Vj  =  1, . . . ,  zn  Vz  =  0, 1,  . . . ,  A'  i'j  :  ©j  3  6j{i)  €  Pj'. 

•  5  is  a  tuple  <  Si S,n  >  where  Sj  is  the  set  of  states  of  background  load  for  the 
machine  Mj  €  H  and  card{Sj)  =  card^Pj)  Vj.  . 

•  '-pi  is  a  tuple  <  ip\, . . .  >  and  Vj  =  1, . . . ,  zn  </?i  is  a  mapping  :  Sj  Pj . 


Having  the  stochastic  processes  {A'^}„=:o,i,.,.  which  describes  dynamic  behavior  of  the 

background  load  of  Mj  £  H  for  j  =  1, . .  .,zzz  and  bijections  tp\  and  tpi  =<  --pX . ip’o"  > 

where  :  Pj  -*  P+,  ~  then  we  may  define  the  new  stochastic  processes 

{7l(zz)}n=o,i,...  for  Mj  £  H.  j  =  1,...  ,  m  describing  the  dynamic  behavior  of  the  times 
required  to  execute  the  single  task  t  £  Vp  on  the  machine  Mj  with  the  background  load 
and  Vj  e  { 1 , . . . ,  zzz} ,  ?z  =  0, 1 , . . . 

T,{n)  =  A  °  ^liXi))  =  (1  +  z?,(A^))  •  r/  (8) 


4  Stochastic  Control  Based  on  Background  Load  Model 

The  control  policy  u  is  the  sequence  {Fn}n-a,n+i.-  ■  where  the  subscript  p  >  0  denotes  the 
starting  time  epoch  (the  time  step  of  the  nonstationary  stochastic  process  describing 
dynamic  behavior  of  the  background  load)  of  the  realized  policy.  The  policy  ti  £  ty,  is  an 
admissible  policy  if 

€  (0. 1],  Vj  =  1, . . . ,  m  Vzz  =  p,  p  -1- 1, . . . 

■  card(P-^(Mj)))  <  )  <  1  (9) 

where  P(')  denotes  the  probability. 
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Let  s,  u)  denotes  the  cost  function  for  the  step  n  —  fi,  p  +  I, . .  .  of  the  realized 

policy  V.  6  (consequently,  that  is  the  cost  of  the  function  Fr,  applied  in  the  period 
[r„,rn+i),  n  =  /r, /x  +  1, . . .),  defined  for  the  initial  distribution  0^  —  (/d^,.  ..,/?(")  of  the 
state  of  background  load  in  the  time  epoch  and  achieved  state  s  =  (s^  . . . ,  s”'),  6  Sj 

in  T„ .  It  is  defined  as: 

C„(0iu,s,u)  =  ma.x(c^{n,s^  ,u))  (10) 

MjeH 

where  c^n,  ,  u)  is  an  immediate  cost  function  for  machine  Mj  E  H  that  is  in  the  state  € 
Sj.  Also,  we  assume  that  Vu  E  Un,  and  r„  >  r^,  and  for  any  initial  distribution  dj^  and 
any  state  in  the  time  epoch  r„  for  Mj  there  exists  an  expectation  value  £^3  {S{n.s^ ,  it)) 
[3,4,6], 

Let  C^{0f,,u)  be  the  finite  horizon  cost  for  the  network  if  related  to  the  policy  it  E  Ufi 
provided  in  the  time  period  [/x,  Z],  (Z  <  oo).  It  is  defined  as: 

/M+z-i  \ 

C^{0f3,u)  =  Epi  Cn{0n,S,u)\  (11) 

\  n=/i  / 

where  ,/?^  =  (,X?i , . . . ,  ). 

We  define  the  control  problem  (see  [4,7,8])  as: 

Problem.  1.  Find  a  policy  u  E  Uf^  that  minimizes  €^{0^3,  u)  over  the  whole  U^. 

5  MDP  Based  Stochastic  Control 

The  following  control  problem  is  based  on  the  theory  of  Markov  Decision  Process  (MDP) 
[3,  4],  Let  T  =  .,T^+z-i}  is  a  set  of  decision  epochs  correspodning  to  the  beginning 

of  the  period  An,  n  =  pt, . . .  and  Z  E  A^,  Z  <  oo.  We  assume  that  the  last  decision  is 
made  at  Tp+z-i- 

In  each  decision  epoch  the  decision  agent  receives  the  state  s  =  {s^ . s'”)  E  5  of 

network  H,  in  which  may  choose  an  action  a  =  (a^, ,  . . ,  a’”)  E  -4j  =  .4]i  x  . .  ,  x  A”!„ . 

where  ,4^4  is  the  discrete  set  of  actions  which  may  be  choosen  if  the  state  of  machine  M/,- 
is  s*'.  Moreover,  we  assume  the  actions  are  choosen  in  deterministic  way.  Assume  that 
A  =  Uj  As  and  A  =  x  ...  x  M"’  where  A^  =  ■  As  a  result  of  choosing 

the  action  E  A*  in  the  state  s*‘  at  the  decision  epoch  r,  is  an  immediate  cost  function 
c^{i,s^,u)  {u  E  Uf3)  and  the  state  of  machine  Mk  in  the  decision  epoch  r,  +  i  is  determined 
by  the  probability  distribution  p^_('|s*',  a*)  and  (i|s^',  a^')  =  1- 

A  decision  rule  prescribes  a  procedure  for  action  selection  in  each  state  and  decision 

epoch  Tj.  It  is  defined  as  a  function  dj  :  S  3  s  dr(s)  e  A,,  where  if  s  =  (s^ . s'" ) 

then  dr{s)  =  (d.l(s^ ),...,  d’f  {s'")).  The  control  policy  is  defined  as  the  sequence  u  = 

(do.di . dz-\).  We  will  use  the  finite  horizon  Markov  deterministic  policies  [3,4.6]  in 

order  to  control  the  execution  of  the  distributed  application.  It  means,  that  each  decision 
rule  depends  on  previous  states  of  the  machines  whole  workload  and  selected  actions  in 
these  states,  and  each  action  is  choo,sen  with  certainty.  The  whole  workload  for  the  machine 
will  be  understood  as  the  common  load  of  background  and  load  derived  from  the  execution 
of  the  task  belonging  to  an  application  A. 

The  admissible  policy  u  is  defined  by  the  formula  (9)  where 

{dri  (®t!  )}n  =0, 1 Z-l  —  —  +  —  {En}  n  —  u.u  +  l...  (1^) 
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The  sample  space  for  the  network  H  has  the  form: 

=  5x>tx5x...x>tx5  =  {5x  x  S  ( 13) 

and  the  elementary  event  E  is  a  sequence 

3  ui^  =  (so,  ao,  si,  ai>  ■  •  • .  o.z-i,sz)  (14) 

Vn  s„  =  (s^,...,s;^),  a„  =  (a;^, . .  .,a”) 


The  random  variables  u)^  €  i  =  0, 1, . . .  are  defined  as  A',-  :  S  and 

'Xi{uJ^)  =  Si,  where  i  corresponds  to  the  decision  epoch  r,;  (and  time  period  zl,).  The 
sequence  of  random  variables  A','(u;),  w  E  f?,  i  =  0, 1, , . . .  are  defined  as  X-  :  Q  ^ 
and  Af(u))  =  sf  where  sf  E  5^  is  the  state  of  the  common  load  of  background  and  load 
derived  from  the  execution  of  the  task  belonging  to  an  application  A  (the  whole  workload). 

If  we  denote  by  hf  the  history  for  the  machine  Mk  E  H  i.e. 


=  {X;^slY^  =  4,...,Y^^=al„X,  =sf),  ^<r.</r  +  Z-l  (15) 
then  the  dynamic  behavior  of  the  state  of  the  whole  workload  of  machine  Mk  is  given  by: 


Pu  (4  +  1  =  hYn  =  «n)  =4(ib^«n)  (16) 

and  hi  =  {h},. . h^)  and  Y„{uj^)  =  a„  £  Ts„,  Xn(i^^)  =  s„. 

The  state  probabilities  7T\fi  +  i+l)  for  random  variable  at  the  decision  epoch 

Tfj+i+i  can  be  evaluated  by: 

F^/i  +  i4-l)=  ^p*^^,(j|s},aj-).77\^  +  0)  (17) 

j€Sk 

where  77^(p  +  i)  is  the  probability  distribution  at  the  decision  epoch 

We  have  no  probabilities  (j  I  .  S' )  computation  of  probability  distribution 

n^{li  +  i)  will  be  based  on  the  probability  distribution  77*^(p  +  z)  of  the  state  of  background 

load.  Let  sign  r,;  =  p(i  +  1)  -  p(i)  and  6  is  the  load  derived  from  the  task  of  application 

A  and  q-,  =  — .  Let  us  assume  that  the  load  8  is  less  then  then  the  evaluation  of  the 
‘ 

probability  distribution  77  (p  +  i)  is  as  follows  (in  matrix  form)  ([8]): 


(77"(n))^  =  B  ■  {n^{n)f  =  B  ■  (P^n)  •  77''(n  -  1))^ 


(18) 


The  superscript  T  denotes  the  matrix  transposition  and  matrix  B  has  the  following  form: 


l-oi  0  0  0 

Ql  1  —  0  0 

0  02  1  —  03  0 


(19) 


0  ...  QA-_1  1 


The  finite  horizon  cost  for  network  H  for  the  deterministic  Markov  policy  u  E  f+,  and 
for  the  finite  horizon  Z  is  defined  as: 
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C\l3,.u) 


(20) 


where  =  y„,h„  =  ,  a„_i,  s),  and  =  a*  |s^)  =  1  and/?,,  =  . /?;"). 

We  define  the  control  problem  CP2  ([8,9])  as: 

Problem^.  Find  a  policy  u  e  that  minimizes  C^{P^,u)  over  the  whole  U^. 

The  existance  of  the  optimal  Markov  deterministic  policy  is  guarantied  by  the  following 
theorem  (see.  [3,8]): 


Theorems.  Assume  S  is  finUe  or  countable.  Then  if 

1.  'is  the  set  of  actions  A,  is  finite,  or 

2.  A,  is  compact,  the  cost  function  Cn{l3^,s,a)  for  the  step  n  of  the  realized  policy  is 

continuous  in  a  is  e  S,  and  3C  <  oo  for  which  Cn[Pii,s,a)  <  C  ia  ^  A,,  s  6  5, 
and  p^{j  I  s^,a*)  is  continuous  in  ij  €  Sk  and  s^  E  S'^  and  n  =  Z  -  1 

then  there  exists  a  deterministic  Markovian  policy  which  is  optimal. 


6  The  Policies  of  Tasks  Distributions 

We  will  consider  policies  which  are  leading  to  the  fastest  execution  of  the  application  .4. 
They  generally  fall  in  two  groups:  the  group  of  deterministic  policies  and  the  group  of 
stochastic  policies  based  on  the  described  Markov  models. 

The  following  policies  from  the  first  group:  single  task,  multiple  task,  dynamic  single 
distribution,  stationary  are  described  in  detail  in  [7,  8]. 

In  order  to  execute  the  group  of  stochastic  policies,  we  introduce  the  following  classes 
of  agents:  state  agents  -  that  monitors  the  load  in  time  and  prepares  the  state  forecast  for 
each  machine  in  H-,  decision  making  agents,  that  are  involved  in  establishing  task  allocation 
policy  based  on  actual  and  forecasted  average  state  of  each  machine  in  H  (or  forecasted 
computation  time  of  task)  [7,  8].  Below  there  are  presented  two  algorithms  belonging  to  the 
group  of  stochastic  policies:  first  based  on  the  background  load  model  (open  loop  control) 
(§3)  and  the  second  one  ba.sed  on  the  MDP  model  (closed  loop  control)  (§4)  [8]. 

6.1  Open  Loop  Control 

The  following  algorithm  is  based  on  the  average  state  of  the  background  load  forecast  ge¬ 
nerated  only  once  in  the  starting  time  epoch  r,,  for  each  machine  in  H .  More  precisely,  this 
policy  rely  on  the  computation  of  the  distributions  (n)  having  the  initial  distributions 
of  the  state  of  background  load  0^  for  each  machine  Aik  €  H  recursively  according  to  the 
formula: 

-  1)  •  nf{n  -  1),  r„  >  r,,  (21) 

jes 

Now,  having  the  vector  of  the  state  probabilities  for  each  time  epoch  we  may  compute 
the  expected  execution  times  of  one  task  on  each  machine  in  each  time  step  which  leads  to 
the  computation  of  power  coefficients  Xj  for  all  machines  AIj  as  well  as  power  coefficients 
A  for  the  network  H  (defined  as  ([7,8]):  Xj  =  X-  j 
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The  consecutive  decision  rule  (or  function  Fn)  allocates  the  subsets  of  tasks 
for  which  cardinal  numbers  are  proportional  to  the  machines  expected  power  coefficients 
for  the  n  =  pi  ,iJi  -P  1. .  ■ . ,  p  -\-  Z  —  \  . 

Algorithm  1 

1.  n  =  p\ 

V[n)  =  Vd\ 


2.  FOR(  each  Mj  €  H  ) 


Let  ^j(n)  = 


M,eH 


[/In  '  (n.)J  , 


3.  IF(  C  <  card(V)  ) 

{  FOR(  each  Mj  6  H  ) 

Let  FF^{Mj)  C  V(n]  such  that  card{F~^{Mj))  =  \_A„  ■  Aj(n)J 

K(n  +  1)  =  V(n.)\U,F„-'(M;); 

n  —  n  +  I', 

GOTO  2; 


ELSE  /*  last  step  */ 

{  ^  —  12mj 

FOR(  each  Mj  €  H  ) 


FOR(  each  Mj  eH  ) 

Let  Wj  =  F,;HMj)  such  that  the  set  l} 

is  minimal  over  {WOlM^eHand  (J.  vv^^VCn)'- 

} 

4.  /***  start  execution  for  the  horizon  Z  ***/ 


The  expected  time  of  a  signle  task  execution  on  the  machine  in  each  next  time  step 
r?  =  p.,  ^  +  1  . .  .  is  as  follows([8]): 

Efj.  (r'(n))  = 

j  {l  +  VkU)) -Tt^  dla  n  =  p 

I  Ef=o  (nf{n)  •  (1  +  mU))  ■  r/’)  dla  n  >  p. 

but  the  expected  cost  for  one  step  Cn(/3fj,Sn,u)  of  the  realized  policy  u  =  {T„ + 
is  as  follows  ([8]): 

C„(3^.s„,u)  =  (23) 

J  maxF„(Mt)  {r’^'(n)  •  (1  +^/ht(j))  •  T/  }  for  n  =  p 

\  (^jin)  ■  (1  +  mU))  '  Tt)'j  for  n  >  p 
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where  -  (A^ , . . . ,  v'^{n)  =  card{F„  and  since  u  is  the  admissible  policy, 

'dk  =  1, . . . ,  m  functions  satisfy  the  constraint  (9). 


6.2  Closed  Loop  Control 

In  the  closed  loop  algorithm  the  choice  of  the  first  decision  in  the  time  epoch  is  a- 
naloguous  to  the  open  loop  algorithm.  The  choice  of  the  consecutive  decisions  (concen- 
ring  the  task  subsets  allocation)  is  based  on  the  forcasted  average  states  of  the  whole 
workload  for  each  machine  in  H  (17  (n)  distributions)  generated  in  each  time  epoch  r„, 
n  =  p.+l, . . . ,  n  +  Z  —  1.  In  other  words,  the  choice  of  action  is  based  on  the  forecasted  com¬ 
putation  time  of  elementary  task  generated  in  each  time  epoch  Tn,  n  =  fi  +  1, _ ^  + 

Algorithm  2 


1.  V{n)  =  VD; 

FOR(  each  Mj  £  H 

Xj{n) 


Let 


1 

■— Ar, 


FOR(  each  Mj  £  H  )  /*  computation  is  based  on  [p)  *  j 

Let  T-i(Mj)  C  V{fi)  such  that  card{F-^{Mj))  =  [Af,  ■  Aj(//)J  ; 


/***  start  execution  in  the  ***/ 


n  =  p  -h  1; 


2. 


FOR(  each  Mj  &  H  )  /*  computation  is  based  on  {n)  */ 


[^n  •  AjJ 


3.  IF(  C  <  card{V)  ) 

{  FOR(  each  Mj  £  H  ) 

Let  FfHMj)  C  V'(r7.)  such  that  cardiFF^Mj))  =  [A,,  ■  Aj(7r)J  : 

/***  start  execution  in  the  A„,  ***/ 

?!  =  n  +  1 ; 

GOTO  2; 


ELSE  /*  last  step  */ 

{  =  Em,  EC"); 

FOR(  each  Mj  £  H  ] 


FOR(  each  Mj  £  H  ) 

Let  Wj  =  F^^iMj)  such  that  the  set 


{I 


card(F,~‘l-V,)} 

cardl  \'(n)) 
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is  minimal  over  and  [j^W,  =  Viny 

} 


The  expected  time  of  a  signle  task  execution  on  the  machine  Mk  in  each  next  time  step 
n  =  p,  p  +  1 . . .  is  as  follows([8]): 


(24) 

j=o 

but  the  expected  cost  for  one  step  C„(/d^,  s„,  u)  of  the  realized  policy  u  =  {an}ii=^+i,/i+2. ...  = 
{Fnh  is  as  follows  ([8]): 


K 

I 

j=Q 


Cn{6„Sn,an)=  max  (l  +  7?,(i))'T/-)  i  (25) 


where  s„  =  (s)j , . . . , s™),  a„  =  {aj^, . . .  and  because  u  is  the  admissible  policy  so. 

Vn  'Vk  a-n  it  has  to  satisfy  constraint  (9). 

But  the  cost  of  the  first  time  step  i.e.  for  n  =  p  is  as  follows: 

K 


Ep.  (TUn))=E^.  (<p^o^J(A'^)))  =£(/7j'(n).  (1  +  %(J))-T)'') 


(26) 


7  Numerical  Tests 

The  numerical  tests  concerns  the  problem  of  determining  the  piezometric  height  distribu¬ 
tion  during  the  filtration  of  water  through  the  cohesive  porous  medium  for  a  small  filtration 
velocity.  The  topological  decomposition  (SBS  compuaiion)  method  was  utilized  to  solve  it. 
The  main  stages  of  this  complex  method  for  the  stationary  problems  are  presented  on  the 
Fig.l  ([8]).  For  the  nonstationary  problems  the  presented  stages  are  repeated  in  each  time 
step  (nonlinear  solver). 


No.  of  tasks 

single 

distribution 

multiple  task 
policy 

ivork-greedy 

policy 

100 

1.5163 

1.9428 

2.6339 

200 

1.5694 

1.9515 

2.6105 

Table  1.  The  Speedup  values  for  the  deterministic  policies  which  are  realized  under  medium 
additional  load 


The  two  most  time  complex  parts  of  SBS  computation  are  the  mairix  formiilalion  and 
non  linear  algebraic  solver.  Therefore,  the  models  and  algorithms  presented  in  this  paper 
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Figure! :  Computational  technology  for  the  CFG  problems 


was  utilized  to  the  distributed  computations  of  these  two  stages.  Detail  description  of  the 
mentioned  CFG  method  and  the  engineering  problem  under  consideration,  and  the  solving 
method  is  derived  in  [5,  7,  8]. 

The  used  network  H  consists  of  SPARC  4  (as  master),  two  SPARC  ELC,  SUN  490  and 
CONVEX  3200  as  vector  node.  During  the  process  of  identification  of  Markov  chain  we 
assume  that  it  is  periodic  wuth  period  equal  to  24  hours  and  time  intervals  zi,j  =  1  hour  (see 
[7]).  We  obtain  speedup  values  for  deterministic  policies  applied  under  medium  background 
load  for  the  above  numerical  application  consisting  of  100  and  200  tasks  (see  Table  1,  and 
see  [7]  for  more  details  of  this  experiment).  These  results  show  various  effectivness  of 
different  deterministic  policies  for  fine  graine  application. 


machine 

^(0)|e(i) 

e[2) 

SUN  490 

0 

1.1 

1.9 

SPARC  ELC 

0 

0.8 

1.2 

CONVEX  3200 

0 

8.0 

9.8 

Table  2.  The  sets  of  tresholds  for  machines  SUN490,  SUN  SPARC  ELC  and  CONVEX  3200 


In  case  of  Markov  background  load  model  we  assume  the  uniform  space  of  load  state.s 
S  =  {0,1,  2}  for  each  machine  in  H  (i.e.  K  =  2  and  VMj  £  H  Sj  =  .S’).  The  state  .s  =  0 
is  the  state  without  background  load  and  s  =  2  is  the  state  with  the  greatest  background 
load.  The  average  values  for  the  sets  of  tresholds  0j  for  machines  SUN490.  SPARC  ELC 
and  CONVEX  3200  are  presented  in  the  Table  2.  They  are  determined  on  the  basis  of  a 
great  number  of  tests  performed  for  the  same  pattern  application  application  and  defined 
so  that  the  sets  of  tresholds  Pj  for  slowing  down  parameters  w'ere  identical  for  each  machine 
in  H.  So,  'iMj  £  H 

To  if  s  =  0 

^j(s)  =  <1  0.4  if  s  =  1  (27) 

0.8  if  s  =  2 
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Figure2;The  probability  distributions  for  three  initial  distributions 


For  example,  if  the  value  of  the  background  load  (loadj)  measured  by  the  stale  agent 
for  the  machine  Mj  belongs  to  the  period  [9j{].),0j(2))  or  equivalently  {(fj  is  a.  bijection) 
9j(/oadj)  €  [77j(l),  »7;(2))  then  the  state  of  the  background  load  is  equel  to  'I'j{y'j(loadj))  = 
1.  Then  having  the  time  of  single  task  execution  without  backgroun  load  T/  =  2.1  sec  the 
execution  time  with  the  additional  background  load  is  equel  to  =  1.4  ■  2.1  =  2. .58  sec. 


I  worfc-greedy  policy  [stationary  policy  |  closed-loop  policyl 
[11931.5  [sec]  |12646.2  [sec]  |12217.8  [sec]  | 

Table  3.  The  execution  times  for  the  fine  grain  application  consisting  of  10000  identical  tasks. 


The  Fig. 2  shows  the  probability  distributions  for  the  case  of  three  possible  initial  di.s- 
tributions  for  LAN  server  SUN  490.  We  can  see  that  only  distributions  for  the  first  hours 
strongly  depend  on  the  initial  distributions  and  further  distributions  do  not  depent  on  it. 
It  means  that  the  process  is  ergodic.  Some  other  results  involving  Markov  chain  estimation 
were  presented  in  [7,  8]. 

Several  tests  were  performed  in  order  to  show  the  influence  of  the  process  granularity  on 
the  control  results  (see  [7,8]).  The  comparition  of  the  deterministic  and  proposed  closed  loop 
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stochastic  control  algorithm  for  the  computation  of  matrix  coefficients  {matrix  formulation 
Stage  -  Fig.l  )  which  is  fine  grain  part  of  CFG  computations  is  presented  in  the  Table  3. 
The  same  algorithms  were  utilized  to  the  control  of  the  coarse  graine  computation.  Such 
granularity  for  the  matrix  coefficients  computations  were  obtained  through  the  division 
of  the  whole  set  of  elementary  tasks  into  the  several  subsets  with  the  different  cardinal 
numbers.  The  table  4  presents  the  cardinal  numbers  of  subsets  -  the  new  coarse  grain  tasks, 
the  number  of  them,  and  the  results  of  computations. 


Number  of  elemetary 
tasks  in  the  subset 

80 

40 

14 

2 

1 

No  of  subsets  i.e. 
coarse  grain  tasks 

60 

80 

120 

80 

160 

work-greedy  policy 

stationary  policy 

closed  loop  policy 

13616.3  [sec] 

15362.9  [sec] 

12760.3  [sec] 

Table  4.  The  execution  times  for  the  coarse  grain  aplication 


The  next  group  of  tests  were  performed  in  order  to  compare  the  stationary  policy  ([8. 
7])  and  proposed  algorithm  of  the  open  loop  control  applied  to  the  second  stage  of  the  CFG 
technology  i.e.  iterative  linear  algebraic  solver  (Fig.l).  In  this  case  there  were  16  similar 
tasks  (one  task  needs  about  3.5  MB  of  memory)  -  each  one  was  involved  with  the  similar 
rectangular  subdomain  (the  generated  mesh  consisted  of  100x160  nodes  and  about  32000 
triangular  elements)  and  the  horizon  Z  =  1  (the  control  for  one  time  step). 


static  control,  16  coarse  grain  tasks 

times  of 

5  tests 

4304  sec 

2764  sec 

4638  sec 

3693  sec 

3952  sec 

Table  5.  The  execution  times  of  SBS-PCG  application  consisting  of  16  coarse  grain  ta.sks  for 
static  control 


In  case  of  the  static  control  which  is  based  on  the  state  of  each  machine  of  the  network 
in  the  starting  time  epoch  we  obtained  the  results  presented  in  the  Table  ??.  The  next 
tests  involved  open  loop  control  were  performed  for  the  case  of  the  Markov  forecast.  The 
results  for  the  background  load  increase  and  decrease  forecast  w'ere  presented  in  the  Table 
6. 

8  Conclusions 

1.  Stochastic  policies  are  better  suited  to  the  coarse  graine  problem  of  tasks  distribution 
e.g.  PSG-SBS  solution  of  large-scale  linear  sy.stems  then  the  deterministic  ones. 

2.  The  closed  loop  control  usually  is  better  then  the  open  loop  one  in  case  of  computa¬ 
tion  on  the  horizon  consisting  of  more  then  one  time  period  because  it  provides  the 
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the  execution  times  for 
Markov  control, 

16  coarse  grain  tasks 

the  background  load 
increase  forecast 

2438  sek 

2588  sek 

the  background  load 
decrease  forecast 

Table  6.  The  execution  times  for  the  SBS-PCG  application  consisting  of  16  coarse  grain  tasks 
for  Markov  control 


better  estimation  of  computational  power  of  each  M  €  H  ioi  each  time  period  during 
computation.  In  other  words,  the  probability  that  closed  loop  control  of  the  distributed 
computations  gives  better  execution  times  than  open  loop  control,  is  greater  then  zero. 

3.  The  presented  models  of  network  workload  and  stochastic  policies  of  task  distribution 
can  be  utilized  not  only  to  CAE  distributed  programs  managing  as  well  as  to  another 
scalable  large  distributed  applications. 
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Abstract.  We  consider  direct  methods  for  the  numerical  solution  of  lin¬ 
ear  systems  with  unsymmetric  sparse  matrices.  Different  strategies  for 
the  determination  of  the  pivots  are  studied.  For  solving  several  linear  sys¬ 
tems  with  the  same  pattern  structure  we  generate  a  pseudo  code,  that 
can  be  interpreted  repeatedly  to  compute  the  solutions  of  these  systems. 
The  pseudo  code  can  be  advantageously  adapted  to  vector  and  parallel 
computers.  For  that  we  have  to  find  out  the  instructions  of  the  pseudo 
code  which  are  independent  of  each  other.  Based  on  this  information, 
one  can  determine  vector  instructions  for  the  pseudo  code  operations 
(vectorization)  or  spread  the  operations  among  different  processors  (par¬ 
allelization).  The  methods  are  successfully  used  oh  vector  and  parallel 
computers  for  the  circuit  simulation  of  VLSI  circuits  as  well  as  for  the 
dynamic  process  simulation  of  complex  chemical  production  plants. 


1  Introduction 

For  solving  systems  of  linear  equations 

Ax  =  b,  Ae  R”''",  1, 6  e  R"  (1) 

with  non  singular,  unsymmetric  and  sparse  matrices  A,  we  use  the  Gaussian 
elimination  method.  Only  the  nonzero  elements  of  the  matrices  are  stored  for 
computation.  In  general,  we  need  to  establish  a  suitable  control  for  the  numerical 
stability  and  for  the  fill-in  of  the  Gaussian  elimination  method. 

For  the  time  domain  simulation  in  many  industrial  applications  structural 
properties  are  used  for  a  modular  modeling.  Thus  electronic  circuits  usually 
consist  of  identical  subcircuits  as  inverter  chains  or  adders.  Analogously,  complex 
chemical  plants  consist  of  process  units  as  pumps,  reboilers  or  trays  of  distillation 
columns.  A  mathematical  model  is  assigned  to  each  subcircuit  or  unit  and  they 
are  coupled.  This  approach  leads  to  initial  value  problems  for  large  systems 
of  differential-algebraic  equations.  For  solving  such  problems  we  use  backward 
differentiation  formulas  and  the  resulting  systems  of  nonlinear  equations  are 

*  This  work  was  supported  by  the  Federal  Ministry  of  Education,  Science,  Research 
and  Technology,  Bonn,  Germany  under  grants  GA7FVB-3.0M370  and  GR7FV1. 
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solved  with  Newton  methods.  The  Jacobi  matrices  are  sparse  and  maintain  their 
sparsity  structure  during  the  integration  over  many  time  steps.  In  general,  the 
Gaussian  elimination  method  can  be  used  with  the  same  ordering  of  the  pivots 
for  these  steps.  A  pseudo  code  is  generated  to  perform  the  factorizations  of  the 
matrices  and  the  solving  of  the  systems  with  triangular  matrices  efficiently.  This 
code  contains  only  the  required  operations  for  the  factorization  and  for  solving 
the  triangular  systems.  It  is  defined  independently  of  a  computer  and  can  be 
adapted  to  vector  and  parallel  computers. 

The  solver  has  been  proven  successfully  for  the  dynamic  process  simulation  of 
large  real  life  chemical  production  plants  and  for  the  electric  circuit  simulation 
as  well.  Computing  times  for  complete  dynamic  simulation  runs  of  industrial 
applications  are  given.  For  different  linear  systems  with  matrices  arising  from 
scientific  and  technical  problems  the  computing  times  for  several  linear  solvers 
are  compared. 


2  The  method 

The  Gaussian  elimination  method 

FAQ  =  LU,  (2) 

Ly~Pb,  UQ~'^x  =  y  (3) 

is  used  for  solving  the  linear  systems  (1).  The  nonzero  elements  of  the  matrix  A 
are  stored  in  compressed  sparse  row  format,  also  known  as  sparse  row  wise 
format.  L  is  a  lower  triangular  and  U  an  upper  triangular  matrix.  The  row 
permutation  matrix  P  is  used  to  provide  numerical  stability  and  the  column 
permutation  matrix  Q  is  used  to  control  sparsity.  In  the  following,  we  consider 
two  cases  for  the  determination  of  the  matrices  P  and  Q. 

In  the  first  case,  we  determine  in  each  elimination  step  a  permutation  in 
the  matrix  Q.  For  this,  we  search  the  first  column  with  a  minimal  number  of 
nonzero  elements  in  the  matrix  to  be  eliminated.  This  column  becomes  the  pivot 
column  [6]  and  the  columns  are  reordered  (dynamic  ordering).  For  keeping  the 
method  numerically  stable  at  stage  k  of  the  elimination,  the  pivot  a,  ^  is  selected 
among  those  candidates  satisfying  the  numerical  threshold  criterion 

\ai  i\  >  B  max  la;  ,1 

I 

with  a  given  threshold  parameter  0  E  (0, 1].  This  process  is  called  partial  pivot¬ 
ing.  In  our  applications  we  usually  choose  ^  =  0.01  or  /3  =  0.001. 

In  the  second  case,  we  determine  in  a  first  step  the  permutation  matrix  Q 
by  minimum  degree  ordering  of  A'^ A  or  of  +  A,  using  the  algorithm  from 
Super  LU  [9].  Then  the  columns  are  reordered  and  in  a  separate  step  the  permu¬ 
tation  matrix  P  is  determined  by  using  partial  pivoting. 
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3  Pseudo  code 

As  mentioned  above,  it  is  possible  to  use  the  Gaussian  elimination  method  with 
the  same  pivot  ordering  to  solve  several  linear  systems  with  the  same  pattern 
structure  of  the  coefficient  matrix.  To  do  this,  we  generate  a  pseudo  code  to 
perform  the  factorization  of  the  matrix  as  well  as  to  solve  the  triangular  systems 
(forward  and  back  substitution). 

For  the  generation  of  the  pseudo  code,  the  factorization  of  the  Gaussian 
elimination  method  is  used  as  shown  in  Fig.  1. 


for  i  =  2,  n  do 

tti— l,i  — 1  “  l/cti— l,t  — 1 
for  i  =  1,71  do 

enddo 

for  j  =  i,n  do 

=  OiJ  —  Oi^ktlkjj 

enddo 

enddo 

OrtiU  ” 


Fig.  1.  Gaussian  elimination  method 


The  algorithm  needs  n  divisions.  Six  different  types  of  pseudo  code  instruc¬ 
tions  are  sufficient  for  the  factorization  of  the  matrix,  four  instructions  for  the 
computation  of  the  elements  of  the  upper  triangular  matrix  and  two  of  the  lower 
triangular  matrix.  For  computing  the  elements  of  the  upper  triangular  matrix 
one  has  to  distinguish  between  the  cases  that  the  element  is  a  pivot  or  not  and 
that  it  exists  or  that  it  is  generated  by  fill-in.  For  the  determination  of  the  ele¬ 
ments  of  the  lower  triangular  matrix  one  has  only  to  distinguish  that  the  element 
exists  or  that  it  is  generated  by  fill-in. 

Let  I,  with  1  <  1  <  6,  denote  the  type  of  the  pseudo  code  instruction,  n 
the  number  of  elements  of  the  scalar  product  and  k,  m,  Jk,  ri  —  1,  2, . . . ,  n  the 
indices  of  matrix  elements.  Then,  the  instruction  of  the  pseudo  code  to  compute 
an  element  of  the  lower  triangular  matrix 

a{k)  =  ^a(fc)  -  a(m) 

is  coded  in  the  following  form 


in 

jn 

k 

m 

m 

n 

ii 

3i 
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The  integer  numbers  l,n,in,jn,k  and  m  are  stored  in  integer  array  elements. 
For  I  and  n  only  one  array  element  is  used. 

The  structure  of  the  other  pseudo  code  instructions  is  analogous. 

Let  jj.  denote  the  number  of  multiplications  and  divisions  for  the  factorization 
of  the  matrix  and  v  the  number  of  nonzero  elements  of  the  upper  and  lower 
triangular  matrices.  Then  one  can  estimate  the  number  of  integer  array  elements 
that  are  necessary  to  store  the  pseudo  code  with 

7(m  + 

At  this  7  ss  2.2  was  found  to  be  sufficient  for  large  systems  with  more  than 
thousand  equations  while  one  has  to  choose  7  w  4  for  smaller  systems. 

4  Vectorization  and  parallelization 

The  pseudo  code  instructions  are  used  for  the  vectorization  and  the  paralleliza¬ 
tion  as  well.  For  the  factorization  in  (2)  and  for  solving  the  triangular  systems 
in  (3),  elements  have  to  be  found  that  can  be  computed  independently  of  each 
other. 

In  the  case  of  the  factorization,  a  matrix 

M  =  {rriij),  rriij  €  IN  U  {0, 1,  2, ... ,  n^} 
is  assigned  to  the  matrix 


LU  =  PAQ, 

where  rriij  denotes  the  level  of  independence. 

In  the  case  of  solving  the  triangular  systems,  vectors 

p=ipi)  and  q={qt),  p,,  g;  €  {0, 1, . . . ,  n} 

are  assigned  analogously  to  the  vectors  x  and  y  from 

Ly  =  Pb  and  UQ~^x  =  y. 

Here  the  levels  of  independence  are  denoted  by  pi  and  qi. 

The  elements  with  the  assigned  level  zero  do  not  need  any  operations.  Now, 
all  elements  with  the  same  level  in  the  factorized  matrix  (2)  as  well  as  in  the 
vectors  x  and  y  from  (3)  can  be  computed  independently.  First  all  elements  with 
level  one  are  computed,  then  all  elements  with  level  two  and  so  on. 

The  levels  of  independence  for  the  matrix  elements  in  (2)  and  for  the  vector 
elements  in  (3)  can  be  computed  with  the  algorithm  of  Yamamoto  and  Taka- 
hashi  [11].  The  algorithm  for  the  determination  of  the  levels  of  independence 
m;  is  shown  in  Fig.  2.  The  corresponding  algorithm  for  the  determination  of 
the  elements  of  the  vectors  p  and  q  is  analogous  to  it. 
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M  =  0 

for  i  =  1 ,  n  —  1  do 

for  all  {j  :  a_,,i  /  0  &  j  >  i}  do 

rrij^i  =  1  +  inajc(mj,i,  TTii.i) 

for  all  {k  :  ai^k  ^  0  &  A:  >  f}  do 

rrij^k  =  1  +  max(mj,fc,mj,i,m,,fc) 

enddo 

enddo 

enddo. 

Fig.  2.  Algorithm  of  Yamamoto  and  Takahashi 


For  a  vector  computer,  we  have  to  find  vector  instructions  at  the  different 
levels  of  independence  [2,7].  Let  a{i)  denote  the  nonzero  elements  in  LU .  The 
vector  instructions,  shown  in  Fig.  3,  have  been  proven  to  be  successful  in  the 
case  of  factorization.  The  difficulty  is  that  the  array  elements  are  addressed 
indirectly.  But  adequate  vector  instructions  exist  for  many  vector  computers. 
The  Cray  vector  computers,  for  example,  have  explicit  calls  to  gather /scatter 
routines  for  the  indirect  addressing. 


s  =  *  a(j,t) 

K 

a{ik)  —  l/a{ik) 
a{ik)  =  a(ifc)  *  “(u) 

a(ik)  =  (a(i|)  *  a(i„,)  +  a{ip)  *  a(i,))  *  a(u.) 

Fig.  3.  Types  of  vector  instructions  for  factorization 


For  parallelization,  it  needs  to  distinguish  between  parallel  computers  with 
shared  memory  and  with  distributed  memory. 

In  the  case  of  parallel  computers  with  shared  memory  and  p  processors,  we 
assign  the  pseudo  code  for  each  level  of  independence  in  parts  of  approximately 
same  size  to  the  processors.  After  the  processors  have  executed  their  part  of  the 
pseudo  code  instructions  of  a  level  concurrently,  a  synchronization  among  the 
processors  is  needed.  Then  the  execution  of  the  next  level  can  be  started.  If  the 
processors  are  vector  processors  then  this  property  is  also  used.  The  moderate 
parallel  computer  Cray  J90  with  a  maximum  number  of  32  processors  is  an 
example  for  such  a  computer. 

In  the  case  of  parallel  computers  with  distributed  memory  and  q  processors, 
the  pseudo  code  for  each  level  of  independence  is  again  partitioned  into  q  parts 
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of  approximately  same  size.  But  in  this  case,  the  parts  of  the  pseudo  code  are 
moved  to  the  memory  of  each  individual  processor.  The  transfer  of  parts  of  the 
code  to  the  memories  of  the  individual  processors  is  done  only  once.  A  synchro¬ 
nization  is  carried  out  analogous  to  the  shared  memory  case.  The  partitioning 
and  the  storage  of  the  matrix  as  well  as  of  the  vectors  is  implemented  in  the  fol¬ 
lowing  way.  For  small  problems  the  elements  of  the  matrix,  right  hand  side  and 
solution  vector  are  located  in  the  memory  of  one  processor,  while  for  large  prob¬ 
lems,  they  have  to  be  distributed  over  the  memories  of  several  processors.  We 
assume  that  the  data  communication  between  the  processors  for  the  exchange 
of  data  concerning  elements  of  the  matrix,  right  hand  side  and  solution  vector  is 
supported  by  the  operating  system.  The  massive  parallel  computers  Cray  T3D 
and  T3E  are  examples  for  such  computers. 

Now,  we  consider  a  small  example  to  illustrate  our  approach.  For  a  matrix 


i; 


the  determination  the  permutation  matrices  P  and  Q  gives 


(4) 


f! 


PAQ  = 


4 

7 

2 


V 


9 

9  1 

1  7  8 
13  5/ 


(5) 


The  nonzero  elements  of  the  matrix  A  are  stored  in  sparse  row  format  in  the 
vector  a.  Let  DU  denote  the  index  of  the  i-th  element  in  the  vector  a,  then  the 
elements  of  the  matrix  P AQ  are  stored  in  the  following  way 


/m  [8]  \ 

m]  SI  M 

[am  m  . 

msiEi] 

\  a  m  m  / 

The  matrix  M  assigned  to  the  matrix  PAQ  is  found  to  be 


M  = 


/  0  0 
1  2 
3 

V 


0 

0  4 

1  0  5 

1  1  6/ 


(7) 


From  (7),  we  can  see,  that  six  independent  levels  exist  for  the  factorization. 

The  instructions  for  the  factorization  of  the  matrix  A  resulting  from  (4)  -  (7) 
are  shown  in  Table  1. 
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Table  1.  Instructions  for  the  factorization 


Level 

Instructions 

a(12) 

=  a(12)/a(7) 

a(9) 

=  a(9)/a(l) 

1 

a(4) 

=  a(4)/a(l) 

a(5) 

=  a(5)/a(10) 

2 

a(13) 

=  a(13)  -  a(12)  *  a(8) 

3 

a(2) 

=  a(2)/a(13) 

4 

a{3) 

=  a(3)  -  a(2)  *  a(14) 

5 

a(ll) 

=  a(ll)  -  a(5)  ★  a(3) 

6 

a(6) 

=  a(6)  -  a(4)  ★  a(3)  -  a(5)  *  a(ll) 

Now,  we  consider,  for  example,  the  instructions  of  level  one  in  Table  1  only. 
One  vector  instruction  of  the  length  four  can  be  generated  (see  Fig, 3)  on  a  vector 
computer. 

On  a  parallel  computer  with  distributed  memory  and  two  processors,  the 
allocation  of  the  instructions  of  level  one  to  the  processors  is  shown  in  Table  2. 
The  transfer  of  the  instructions  to  the  local  memory  of  the  processors  is  done 
during  the  analyse  step  of  the  algorithm.  The  data  transfer  is  carried  out  by  the 
operating  system. 


Table  2.  Allocation  of  instructions  to  processors 


processor  processor 
_ one _ two 

computation  of  a(12),  a(9)  a(4),  a(5) 

synchronization 


On  a  parallel  computer  with  shared  memory  the  approach  is  analogous.  The 
processors  have  to  be  synchronized  after  the  execution  of  the  instructions  of  each 
level. 

From  our  experiments  with  many  different  matrices  arising  from  the  process 
simulation  of  chemical  plants  and  the  circuit  simulation  respectively,  it  was  found 
that  the  number  of  levels  of  independence  is  small.  The  number  of  instructions 
in  the  first  two  levels  is  very  large,  in  the  next  four  to  six  levels  it  is  large  and 
finally  it  becomes  smaller  and  smaller. 
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5  Numerical  results 

The  developed  numerical  methods  are  realized  in  the  program  package  GSPAR. 
GSPAR  is  implemented  on  workstations  (Digital  AlphaStation,  IBM  RS/6000, 
SGI,  Sun  UltraSparc  1  and  2),  vector  computers  (Gray  J90,  C90),  parallel 
computers  with  shared  memory  (Cray  J90,  C90,  SGI  0rigin2000,  Digital  Al¬ 
phaServer)  and  parallel  computers  with  distributed  memory  (Cray  T3D). 

The  considered  systems  of  linear  equations  result  from  real  life  problems  in 
the  dynamic  process  simulation  of  chemical  plants,  in  the  electric  circuit  simula¬ 
tion  and  in  the  account  of  capital  links  (political  sciences)  The  nxn  matrices 
A  with  \A\  nonzero  elements  are  described  in  Table  3. 


Table  3.  Test  matrices 


name 

discipline 

n 

1^1 

bayerOl 

chemical 

57  735 

277  774 

b_dyn 

engineering 

1  089 

4  264 

bayer02 

13  935 

63  679 

bayer03 

6  747 

56  196 

bayer04 

20  545 

159  082 

bayerOS 

3  268 

27  836 

bayerOB 

3  008 

27  576 

bayer09 

3  083 

21  216 

bayerlO 

13  436 

94  926 

advice3388 

circuit 

33  88 

40  545 

advice3776 

simulation 

3  776 

27  590 

cod2655  tr 

2  655 

24  925 

megl 

2  904 

58  142 

meg4 

5  960 

46  842 

rlxADC  dc 

5  355 

24  775 

rlxADC_tr 

5  355 

32  251 

zy33l5 

3  315 

15  985 

poli 

account  of 

4  008 

8  188 

poli  large 

capital  links 

15  575 

33  074 

In  Table  4  results  for  the  matrices  in  Table  3  are  shown  using  the  method 
GSPAR  on  a  DEC  AlphaServer  with  an  alpha  EV5.6  (21164A)  processor.  Here, 
#  op  LU  IS  the  number  of  operations  (only  multiplications  and  divisions)  and 
fill-in  is  the  number  of  fill-ins  during  the  factorization.  The  cpu  time  (in  seconds) 

Some  matrices,  which  are  given  in  Harwell-Boeing  format  and  interesting  details 
about  the  matrices,  can  be  found  in  Tim  Davis,  University  of  Florida  Sparse  Matrix 
Collection,  http://www.cise.ufl.edu/~davis/sparse/ 
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for  the  first  factorization,  presented  in  strat,  includes  the  times  for  the  analysis 
as  well  as  for  the  numerical  factorization.  The  cpu  time  for  the  generation  of 
the  pseudo  code  is  given  in  code.  At  the  one  hand,  a  dynamic  ordering  of  the 
columns  can  be  applied  during  the  pivoting.  At  the  other  hand,  a  minimum 
degree  ordering  of  A^A  (upper  index  *)  or  of  A'^  +  A  (upper  index'*')  can  be 
used  before  the  partial  pivoting. 


Table  4.  GSPAR  first  factorization  and  generation  pseudo  code 


dynamic  ordering  minimum  degree  ordering 


name 

#  op  LU 

fill-in 

strat. 

code 

#  op  LU 

fill-in 

strat. 

code 

bayerOl 

10  032  621 

643  898 

35.18 

12.72 

13  860  173 

812  505 

5.75 

9.95* 

b_dyn 

15  902 

2  909 

0.02 

0 

21  556 

8  231 

0.02 

0.02* 

bayer02 

2  095  207 

134  546 

2.28 

1.30 

2  030  130 

165  357 

1.03 

2.20* 

bayerOS 

1  000  325 

64  130 

0.68 

0.47 

625  272 

53  991 

0.25 

0.35* 

bayer04 

5  954  718 

268  006 

5.33 

3.93 

6  340  579 

290  021 

1.95 

2.77* 

bayerOS 

119  740 

11  024 

0.15 

0.03 

474  273 

33  797 

0.18 

0.17* 

bayerOG 

3  042  620 

73  773 

0.85 

1.00 

5  008  097 

129  278 

1.42 

1.52* 

bayer09 

364  731 

23  145 

0.18 

0.15 

287  947 

22  022 

0.12 

0.12* 

bayerlO 

5  992  500 

227  675 

3.05 

2.55 

3  953  687 

203  633 

1.28 

1.40* 

advice3388 

310  348 

9  297 

0.38 

0.65 

396  965 

9  818 

0.75 

0.95 

advice3776 

355  465 

25  656 

0.35 

0.75 

382  224 

26  074 

0.62 

0.98  + 

cod2655_tr 

3  331  105 

113  640 

0.90 

1.00 

4  839  771 

144  875 

1.50 

1.40  + 

megl 

796  797 

40  436 

0.32 

0.40 

1  245  847 

59  558 

0.48 

0.78  + 

meg4 

420  799 

38  784 

0.68 

0.62 

376  324 

35  008 

0.30 

0.48  + 

rlxADC_dc 

73  612 

5  404 

0.38 

0.13 

63  227 

2  906 

0.08 

0.08  + 

rlxADC  tr 

988  759 

47  366 

0.85 

1.13 

1  049  623 

48  888 

0.72 

1.13  + 

zy3315 

47  326 

8  218 

0.12 

0.03 

49  263 

8  202 

0.03 

0.02  + 

poli 

4  620 

0.15 

0 

41 

0.02 

0* 

poli_large 

43  310 

10  318 

2.38 

0.25 

34  115 

588 

0.08 

0.03  + 

The  results  in  Table  4  show  the  following  characteristics.  For  linear  systems 
arising  from  the  process  simulation  of  chemical  plants,  the  analyse  step  with  the 
minimum  degree  ordering  is  in  most  cases,  particularly  for  large  systems,  faster 
then  with  the  dynamic  ordering,  but  the  fill-in  and  the  number  of  operations 
for  the  factorization  are  larger.  On  the  other  hand,  for  systems  arising  from  the 
circuit  simulation  the  factorization  with  the  dynamic  ordering  is  in  most  cases 
faster  then  the  minimum  degree  ordering.  The  factorization  with  the  minimum 
degree  ordering  of  A^A  is  favourable  for  systems  arising  from  chemical  process 
simulation,  while  using  an  ordering  of  A^  -f  A  is  recommendable  for  systems 
arising  from  the  circuit  simulation.  The  opposite  cases  of  the  minimum  degree 
ordering  are  unfavourable  because  the  number  of  operations  and  the  number  of 
fill-ins  is  very  large. 
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In  Table  5,  cpu  times  (in  seconds)  for  the  second  factorization  are  shown 
for  the  linear  solvers  UMFPACK  [4],  SuperLU  with  minimum  degree  order¬ 
ing  of  A'^A  (upper  index  *)  or  of  4-  A  (upper  index +)  [5],  Sparse  [8]  and 
dynamical  column  ordering,  using  a  DEC  AlphaStation  with  an 
alpha  EV4.5  (21064)  processor.  In  many  applications,  mainly  in  the  numerical 
simulation  of  physical  and  chemical  problems,  the  analysis  step  including  order¬ 
ing  and  first  factorization  is  performed  only  a  few  times,  but  the  second  factor¬ 
ization  is  performed  often.  Therefor  the  cpu  time  for  the  second  factorization  is 
essential  for  the  overall  simulation  time. 


Table  5.  Cpu  times  for  second  factorization 


name 

UMFPACK 

SuperLU 

Sparse 

GSPAR 

bayerOl 

5.02 

6.70* 

7.78 

3.20 

b_dyn 

0.05 

0.05* 

0.07 

0.00 

bayer02 

1.13 

1.47* 

10.433 

0.55 

bayerOS 

0.72 

0.70* 

17.467 

0.27 

bayer04 

3.37 

2.77* 

187.88 

1.70 

bayerOS 

0.13 

0.75* 

0.08 

0.05 

bayerOO 

0.83 

0.90* 

54.33 

0.82 

bayer09 

0.23 

0.23* 

3.57 

0.10 

bayerlO 

1.60 

1.57* 

379.75 

1.65 

advice3388 

0.25 

0.28  + 

0.15 

0.10 

advice3776 

0.30 

0.42  + 

0.20 

0.10 

cod2655  tr 

0.30 

0.55  + 

0.27 

0.10 

megl 

0.58 

1.43  + 

13.95 

0.22 

meg4 

0.37 

0.75  + 

0.25 

0.13 

rlxADC  dc 

0.15 

0.18  + 

0.04 

0.03 

rlxADC  tr 

0.40 

0.90  + 

0.72 

0.30 

zy3315 

0.15 

+ 

00 

o 

0.03 

0.02 

poll 

0.03 

0.07  + 

0.00 

0.00 

poll  _  large 

0.13 

0.27  + 

0.04 

0.03 

GSPAR  achieves  a  fast  second  factorization  for  all  linear  systems  in  Table  5. 
For  linear  systems  with  a  large  number  of  equations  GSPAR  is  at  least  two  times 
faster  then  UMFPACK,  SuperLU  and  Sparse  respectively. 

The  cpu  times  for  solving  the  triangular  matrices  are  one  order  of  magnitude 
smaller  then  the  cpu  times  for  the  factorization.  The  proportions  between  the 
different  solvers  are  comparable  to  the  results  in  Table  5. 

The  vector  version  of  GSPAR  has  been  compared  with  the  frontal  method 
FAMP  [12]  on  a  vector  computer  Cray  Y-MP8E  using  one  processor.  The  used 
version  of  FAMP  is  the  routine  from  the  commercial  chemical  process  simulator 


972 


VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


SPEEDUP  ^  [1].  The  cpu  times  (in  seconds)  for  the  second  factorization  are 
shown  in  Table  6. 


Table  6.  Cpu  times  for  second  factorization 


name 

FAMP 

GSPAR 

b  dyn 

0.034 

0.011 

bayer09 

0.162 

0.082 

bayer03 

0.404 

0.221 

bayer02 

0.683 

0.421 

bayerlO 

1.290 

0.738 

bayer04 

2.209 

0.983 

GSPAR  is  at  least  two  times  faster  then  FAMP  for  these  examples.  The 
proportions  for  solving  the  triangular  systems  are  again  the  same. 

For  two  large  examples  the  number  of  levels  of  independence  are  given  in 
Table  7,  using  GSPAR  with  two  different  ordering  for  pivoting.  The  algorithm 
for  lower  triangular  systems  is  called  forward  substitution  and  the  analogous 
algorithm  for  upper  triangular  systems  is  called  back  substitution. 


Table  7.  Number  of  levels  of  independence 


example 

dynamical  ordering 

minimum  degree  ordering 

factorization 

3  077 

3  688 

bayerOl 

forward  sub. 

1  357 

1  562 

back  substit. 

1  728 

2  476 

factorization 

876 

820 

bayer04 

forward  sub. 

399 

338 

back  substit. 

556 

495 

In  Table  8,  wall-clock  times  (in  seconds)  are  shown  for  the  second  fac¬ 
torization,  using  GSPAR  with  different  pivoting  on  a  DEC  AlphaServer  with 
four  alpha  EV5.6  (21164A)  processors.  The  parallelization  technique  is  based 
on  OpenMP  [10].  The  wall-clock  times  have  been  determined  with  the  system 
routine  gettimeofday . 

In  Table  9,  the  cpu  times  (in  seconds)  on  a  Cray  T3D  are  given  for  the  sec¬ 
ond  factorization,  using  GSPAR  with  dynamic  ordering  for  pivoting.  The  linear 

^  Used  under  licence  95122131717  for  free  academic  use  from  Aspen  Technology,  Cam¬ 
bridge,  MA,  USA;  Release  5.5-5 
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Table  8.  Wall-clock  times  for  second  factorization 


processors 

dynamical  ordering 
bayerOl  bayer04 

minimum 

bayerOl 

degree  ordering 
bayer04 

1 

0.71 

0.39 

1.08 

0.43 

2 

0.54 

0.27 

0.75 

0.29 

3 

0.45 

0.23 

0.63 

0.25 

4 

0.49 

0.24 

0.70 

0.30 

systems  can  not  be  solved  with  less  then  four  or  sixteen  processors  respectively, 
because  the  processors  of  the  T3D  have  not  enough  local  memory  for  the  storage 
of  the  pseudo  code  in  this  cases.  The  speedup  factors  are  set  equal  to  one  for 
four  or  sixteen  processors  respectively. 


Table  9.  Cpu  times  for  second  factorization  on  Cray  T3D 


example 

processors 

cpu  time 

speedup  factor 

4 

1.59 

1.00 

8 

0.99 

1.60 

bayer04 

16 

0.60 

2.65 

32 

0.37 

4.30 

64 

0.24 

6.63 

16 

2.36 

1.00 

bayerOl 

32 

1.45 

1.63 

64 

0.95 

2.47 

6  Applications 


Problems  of  the  dynamic  process  simulation  of  chemical  plants  can  be  modeled 
by  initial  value  problems  for  systems  of  differential-algebraic  equations.  The 
numerical  solution  of  these  systems  [3]  involves  the  solution  of  large  scale  systems 
of  nonlinear  equations,  which  can  be  solved  with  modified  Newton  methods. 
The  Newton  corrections  are  found  by  solving  large  unsymmetric  sparse  systems 
of  linear  equations.  The  overall  computing  time  of  the  simulation  problems  is 
often  dominated  by  the  time  needed  to  solve  the  linear  systems.  In  industrial 
applications,  the  solution  of  sparse  linear  systems  requires  often  more  then  70  % 
of  the  total  simulation  time.  Thus  a  reduction  of  the  linear  system  solution  time 
usually  results  into  a  significant  reduction  of  the  overall  simulation  time  |13]. 
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Table  10  shows  three  large  scale  industrial  problems  of  the  Bayer  AG  Lever¬ 
kusen.  The  number  of  differential-algebraic  equations  as  well  as  an  estimate 
for  the  condition  number  of  the  matrices  of  the  linear  systems  are  given.  The 
condition  numbers  are  very  large,  what  is  typical  for  industrial  applications  in 
this  field. 


Table  10.  Large  scale  industrial  problems 


name 

chemical  plants 

equations 

condition  numbers 

bayer04 

nitration  plant 

3  268 

2.95E-h26,  1.4E-i-27 

bayerlO 

distillation  column 

13  436 

1.4E+15 

bayerOl 

five  coupled  distillation  columns 

57  735 

6.0E-f-18  6.96E4-18 

The  problems  have  been  solved  on  a  vector  computer  Cray  090  using  the 
chemical  process  simulator  SPEEDUP  [l].  In  SPEEDUP  the  vector  versions  of 
the  linear  solvers  FAMP  and  GSPAR  have  been  used  alternatively.  The  cpu  time 
(in  seconds)  for  complete  dynamic  simulation  runs  are  shown  in  Table  11. 


Table  11.  Cpu  time  for  complete  dynamic  simulation 


name 

FAMP 

GSPAR 

in  % 

bayer04 

451.7 

283.7 

62.8 

bayerlO 

380.9 

254.7 

66.9 

For  the  large  plant  bayerOl  benchmark  tests  have  been  performed  on  a  dedi¬ 
cated  computer  Cray  J90,  using  the  simulator  SPEEDUP  with  the  solvers  FAMP 
and  GSPAR  alternatively.  The  results  are  given  in  Table  12. 


Table  12.  Bench  mark  tests 


time 

FAMP 

GSPAR 

in  % 

cpu  time 

6  066.4 

5  565.8 

91.7 

wall-clock  time 

6  697.9 

5  797.1 

86.5 

The  simulation  of  plant  bayerOl  has  been  performed  also  on  a  vector  com¬ 
puter  Cray  C90  connected  with  a  parallel  computer  Cray  T3D.  using  SPEEDUP 
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and  the  parallel  version  of  GSPAR.  Here,  the  linear  systems  have  been  solved 
on  the  parallel  computer  while  the  other  parts  of  the  algorithms  of  SPEEDUP 
have  been  performed  on  the  vector  computer.  GSPAR  needs  1  440.5  seconds  cpu 
time  on  a  T3D  with  64  used  processors,  When  executed  on  the  Cray  C90  only, 
2  490  seconds  are  needed  for  the  total  simulation. 
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Abstract.  The  Polar  project  has  the  aim  of  designing  a  parallel,  ODMG 
compatible  object  database  server.  This  paper  describes  the  server  requirements 
and  investigates  issues  in  designing  a  system  to  achieve  them.  We  believe  that  it 
is  important  to  build  on  experience  gained  in  the  design  and  usage  of  parallel 
relational  database  systems  over  the  last  ten  years,  as  much  is  also  relevant  to 
parallel  object  database  systems.  Therefore  we  present  an  overview  of  the 
design  of  parallel  relational  database  servers  and  investigate  how  their  design 
choices  could  be  adopted  for  a  parallel  object  database  server.  We  conclude  that 
while  there  are  many  similarities  in  the  requirements  and  design  options  for 
these  two  types  of  parallel  database  servers,  there  are  a  number  of  significant 
differences,  particularly  in  the  areas  of  object  access  and  method  execution. 


1  Introduction 

The  parallel  database  server  has  become  the  “killer  app”  of  parallel  computing.  The 
commercial  market  for  these  systems  is  now  significantly  larger  than  that  for  parallel 
systems  running  numeric  applications,  making  them  mainstream  IT  system 
components  offered  by  a  number  of  major  computer  vendors.  They  can  provide  high 
performance,  high  availability,  and  high  storage  capacity,  and  it  is  this  combination  of 
attributes  which  has  allowed  them  to  meet  the  growing  requirements  of  the  increasing 
number  of  computer  system  users  who  need  to  store  and  access  large  amounts  of 
information. 

There  are  a  number  of  reasons  to  explain  the  rapid  rise  of  parallel  database  servers, 
including: 

•  they  offer  higher  performance  at  a  better  cost-performance  ratio  than  do  the 
previous  dominant  systems  in  their  market  -  mainframe  computers. 

•  designers  have  been  able  to  produce  highly  available  systems  by  exploiting  the 
natural  redundancy  of  components  in  parallel  systems.  High  availability  is 
important  because  many  of  these  systems  are  used  for  business-critical  applications 
in  which  the  financial  performance  of  the  business  is  compromised  if  the  data 
becomes  inaccessible. 
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•  the  1990s  have  seen  a  process  of  re-centralisation  of  computer  systems  following 
the  trend  towards  downsizing  in  the  1980s,  The  reasons  for  this  include:  cost 
savings  (particularly  in  software  and  system  management),  regaining  central 
control  over  information,  improving  data  integrity  by  reducing  duplication,  and 
increasing  access  to  information.  This  process  has  created  the  demand  for  powerful 
information  servers. 

•  there  has  been  a  realisation  that  many  organisations  can  derive  and  infer  valuable 
information  from  the  data  held  in  their  databases.  This  has  led  to  the  use  of 
techniques,  such  as  data-mining,  which  can  place  additional  load  on  the  database 
server  from  which  the  base  data  on  which  they  operate  must  be  accessed. 

•  the  growing  use  of  the  Internet  and  intranets  as  ways  of  making  information 
available  both  outside  and  inside  organisations  has  increased  the  need  for  systems 
which  can  make  large  quantities  of  data  available  to  large  numbers  of  simultaneous 
users. 

The  growing  importance  of  parallel  database  servers  is  reflected  in  the  design  of 
commercial  parallel  platforms.  Efficient  support  for  parallel  database  servers  is  now  a 
key  design  requirement  for  the  majority  of  parallel  systems. 

To  date,  almost  all  parallel  database  servers  have  been  designed  to  support 
relational  database  management  systems  (RDBMS)  [Ij.  A  major  factor  which  has 
simplified,  and  so  encouraged,  the  deployment  of  parallel  RDBMS  by  organisations  is 
their  structure.  Relational  database  systems  have  a  client-server  architecture  in  which 
client  applications  can  only  access  the  server  through  a  single  restricted  and  well 
defined  query  interface.  To  access  data,  clients  must  send  an  SQL  (Structured  Query 
Language)  query  to  the  server  where  it  is  compiled  and  executed.  This  architecture 
allows  a  serial  server  which  is  not  able  to  handle  the  workload  generated  by  a  set  of 
clients  to  be  replaced  by  a  parallel  server  with  higher  performance.  The  client 
applications  are  unchanged:  they  still  send  the  same  SQL  to  the  parallel  server  as  they 
did  to  the  serial  server  because  the  exploitation  of  parallelism  is  completely  internal  to 
the  server. 

Existing  parallel  relational  database  servers  exploit  two  major  types  of  parallelism. 
Inter-query  parallelism  is  concerned  with  the  simultaneous  execution  of  a  set  of 
queries.  It  is  typically  used  for  On-Line  Transaction  Processing  (OLTP)  workloads  in 
which  the  server  processes  a  continuous  stream  of  small  transactions  generated  by  a 
set  of  clients.  Intra-Query  parallelism  is  concerned  with  exploiting  parallelism  within 
the  execution  of  single  queries  so  as  to  reduce  their  response  time. 

Despite  the  current  market  dominance  of  relational  database  servers,  there  is  a 
growing  belief  that  relational  databases  are  not  ideal  for  a  number  of  types  of 
applications,  and  in  recent  years  there  has  been  a  growth  of  interest  in  object  oriented 
databases  which  are  able  to  overcome  many  of  the  problems  inherent  in  relational 
systems  [2].  In  particular,  the  growth  in  the  use  of  object  oriented  programming 
languages,  such  as  C+-i-  and  Java,  coupled  with  the  increasing  importance  of  object- 
based  distributed  systems  has  promoted  the  use  of  object  database  management 
systems  (ODBMS)  as  key  system  components  for  object  storage  and  access.  A 
consequence  of  the  interest  in  object  databases  has  been  an  attempt  to  define  a 
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Standard  specification  for  all  the  key  ODBMS  interfaces  -  the  Object  Database 
Management  Group  ODMG  2.0  standard  [3]. 

The  Polar  project  has  the  aim  of  designing  and  implementing  a  prototype  ODMG 
compatible  parallel  object  database  server.  Restricting  the  scope  of  the  project  to  this 
standard  allows  us  to  ignore  many  issues,  such  as  query  language  design  and 
programming  language  interfaces,  and  instead  focus  directly  on  methods  for  exploiting 
parallelism  within  the  framework  imposed  by  the  standard. 

In  this  paper  we  describe  the  requirements  that  the  Polar  server  must  meet  and 
investigate  issues  in  designing  a  system  to  achieve  them.  Rather  than  design  the 
system  in  isolation,  we  believe  that  it  is  important  to  build,  where  possible,  on  the 
extensive  experience  gained  over  the  last  ten  years  in  the  design  and  usage  of  parallel 
relational  database  systems.  However,  as  we  describe  in  this  paper,  differences 
between  the  object  and  relational  database  paradigms  result  in  significant  differences 
in  some  areas  of  the  design  of  parallel  servers  to  support  them.  Those  differences  in 
the  paradigms  which  have  most  impact  on  the  design  are: 

•  objects  in  an  object  database  can  be  referenced  by  a  unique  identifier,  or 
(indirectly)  as  members  of  a  collection.  In  contrast,  tables  (collections  of  rows)  are 
the  only  entities  which  can  be  referenced  by  a  client  of  a  relational  database  (i.e. 
individual  table  rows  cannot  be  directly  referenced). 

•  there  are  two  ways  to  access  data  held  in  an  object  database;  through  a  query 
language  (OQL),  and  by  directly  mapping  database  objects  into  client  application 
program  objects.  In  a  relational  database,  the  query  language  (SQL)  is  the  only  way 
to  access  data. 

•  objects  in  an  object  database  can  have  associated  user-defined  methods  which  may 
be  called  within  queries,  and  by  client  applications  which  have  mapped  database 
objects  into  program  objects.  In  a  relational  database,  there  is  no  equivalent  of  user- 
defined  methods:  only  a  fixed  set  of  operations  is  provided  by  SQL. 

The  structure  of  the  rest  of  this  paper  is  as  follows.  In  Section  2  we  give  an  overview 
of  the  design  of  parallel  relational  database  servers,  based  on  our  experiences  in  two 
previous  parallel  database  server  projects;  EDS  [4],  and  Goldrush  [1].  Next,  in  Section 
3  we  define  our  requirements  for  the  Polar  parallel  ODBMS.  Some  of  these  are 
identical  to  those  of  parallel  RDBMS;  others  have  emerged  from  experience  of  the 
limitations  of  existing  parallel  servers;  while  others  are  derived  from  our  view  of  the 
potential  use  of  parallel  ODBMS  as  components  in  distributed  systems.  Based  on 
these  requirements,  in  Section  4,  we  present  an  overview  of  issues  in  the  design  of  a 
parallel  ODBMS.  This  allows  us  to  highlight  those  areas  in  the  design  of  a  parallel 
object  database  server  where  it  is  possible  to  adopt  solutions  based  on  parallel 
RDBMS  or  serial  ODBMS,  and,  in  contrast,  those  areas  where  new  solutions  are 
required.  Finally,  in  Section  5  we  draw  conclusions  from  our  investigations,  and  point 
to  further  work. 
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1.1  Related  Work 

There  have  been  a  number  of  parallel  relational  database  systems  described  in  the 
literature.  Those  which  have  most  influenced  this  paper  are  the  two  on  which  we 
previously  worked:  EDS  and  Goldrush. 

The  EDS  project  [4]  designed  and  implemented  a  complete  parallel  database 
server,  including  hardware,  operating  system  and  database.  The  database  itself  was 
basically  relational  though  there  were  some  extensions  to  provide  support  for  objects. 
The  Goldrush  project  within  ICL  High  Performance  Systems  [1,  5,  6]  designed  a 
parallel  relational  database  server  product  running  Parallel  Oracle. 

There  is  extensive  coverage  in  the  literature  of  research  into  the  design  of  serial 
ODBMS.  One  of  the  most  complete  is  the  description  of  the  02  system  [7], 

Recently,  there  has  been  some  research  into  the  design  of  parallel  ODBMS.  For 
example.  Goblin  [8]  is  a  parallel  ODBMS.  However,  unlike  the  system  described  in 
this  paper,  it  is  limited  to  a  main-memory  database. 

Work  on  object  servers  such  as  Shore  [9]  and  Thor  [10]  is  also  of  relevance  as  this 
is  a  key  component  of  any  parallel  ODBMS.  We  will  refer  to  this  work  at  appropriate 
points  in  the  body  of  the  paper. 


2  The  Design  of  Parallel  Relational  Database  Servers 

Parallel  relational  database  servers  have  been  designed,  implemented  and  utilised  for 
over  ten  years,  and  it  is  important  that  this  experience  is  used  to  inform  the  design  of 
parallel  object  database  servers.  Therefore,  in  this  section  we  give  an  overview  of  the 
design  of  parallel  relational  database  servers.  This  will  then  allow  us  to  highlight 
commonalties  and  differences  in  the  requirements  (Section  3)  and  design  options 
(Section  4)  between  the  two  types  of  parallel  database  servers.  The  rest  of  this  section 
IS  structured  as  follows.  We  begin  by  describing  the  architectures  of  parallel  platforms 
(hardware  and  operating  system)  designed  to  support  parallel  database  servers.  Next, 
we  describe  methods  for  exploiting  parallelism  found  in  relational  database  workloads’ 
Throughout  this  section  we  will  draw  on  examples  from  our  experience  in  the  design 
and  use  of  the  ICL  Goldrush  MegaServer  [1], 


2.1  Parallel  Platforms 

In  this  section,  we  describe  the  design  of  parallel  platforms  (which  we  define  as 
comprising  the  hardware  and  operating  system)  to  meet  the  requirements  of  database 
servers. 

We  are  interested  in  systems  utilising  the  highly  scaleable  distributed  memory 
parallel  hardware  architecture  [5]  in  which  a  set  of  computing  nodes  are  connected  by 
a  high  performance  network  (Fig.  1).  Each  node  has  the  architecture  of  a  uniprocessor 
or  shared  store  multiprocessor  -  one  or  more  CPUs  share  the  local  main  memory  over 
a  bus.  Typically,  each  node  also  has  a  set  of  locally  connected  disks  for  the  persistent 
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storage  of  data.  Connecting  disks  to  each  node  allows  the  system  to  provide  both  high 
10  performance,  by  supporting  parallel  disk  access,  and  high  storage  capacity.  In  most 
current  systems  the  disks  are  arranged  in  a  share  nothing  configuration  -  each  disk  is 
physically  connected  to  only  one  node.  As  will  be  seen,  this  has  major  implications  for 
the  design  of  the  parallel  database  server  software.  It  is  likely  that  in  future  the 
availability  of  high  bandwidth  peripheral  interconnects  will  lead  to  the  design  of 
platforms  in  which  each  disk  is  physically  connected  to  more  than  one  node  [11], 
however  this  'shared  disk’  configuration  is  not  discussed  further  in  this  paper  as  we 
focus  on  currently  prevalent  hardware  platform  technology.  Database  servers  require 
large  main  memory  caches  for  efficient  performance  (so  as  to  reduce  the  number  of 
disk  accesses)  and  so  the  main  memories  tend  to  be  large  (currently  0.25-4GB  is 
typical).  Some  of  the  nodes  in  a  parallel  platform  will  have  external  network 
connections  to  which  clients  are  connected.  External  database  clients  send  database 
queries  to  the  server  through  these  connections  and  later  receive  the  results  via  them. 
The  number  of  these  nodes  (which  we  term  Communications  Nodes)  varies  depending 
on  the  required  performance  and  availability.  In  terms  of  performance,  it  is  important 
that  there  are  enough  Communications  Nodes  to  perform  client  communication 
without  it  becoming  a  bottleneck,  while  for  availability  it  is  important  to  have  more 
than  one  route  from  a  client  to  a  parallel  server  so  that  if  one  fails,  another  is  available. 


Fig.  1.  A  Distributed  Memory  Parallel  Architecture 

The  distributed  memory  parallel  hardware  architecture  is  highly  scaleable  as  adding 
a  node  to  a  system  increases  all  the  key  performance  parameters  including:  processing 
power,  disk  throughput  and  capacity,  and  main  memory  bandwidth  and  capacity.  For  a 
database  server,  this  translates  into  increased  query  processing  power,  greater  database 
capacity,  higher  disk  throughput  and  a  larger  database  cache. 

The  design  of  the  hardware  must  also  contribute  to  the  creation  of  a  high 
availability  system.  Methods  of  achieving  this  include  providing  component 
redundancy  and  the  ability  to  replace  failed  components  through  hot-pull  &  push 
techniques  without  having  to  take  the  system  down  [5]. 


981 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


The  requirement  to  support  a  database  server  also  influences  the  design  of  the 
operating  system  in  a  number  of  ways.  Firstly,  commercial  database  server  software 
depends  on  the  availability  of  a  relatively  complete  set  of  standard  operating  system 
facilities,  including  support  for;  file  accessing,  processes  (with  the  associated  inter¬ 
process  communications)  and  external  communications  to  clients  through  standard 
protocols.  Secondly,  it  is  important  that  the  cost  of  inter-node  communications  is 
minimised  as  this  directly  affects  the  performance  of  a  set  of  key  functions  including 
remote  data  access  and  query  execution.  Finally,  the  operating  system  must  be 
designed  to  contribute  to  the  construction  of  a  high  availability  system.  This  includes 
ensuring  that  the  failure  of  one  node  does  not  cause  other  nodes  to  fail,  or  unduly 
affect  their  performance  (for  example  through  having  key  services  such  as  distributed 
file  systems  delayed  for  a  significant  time  waiting,  until  a  long  time-out  occurs,  for  a 
response  from  the  failed  node). 

Experience  in  a  number  of  projects  suggests  that  micro-kernel  based  operating 
systems  provide  better  support  than  do  monolithic  kernels  for  adding  and  modifying 
the  functionality  of  the  kernel  to  meet  the  requirements  of  parallel  database  servers 
[1].  This  is  for  two  reasons:  they  provide  a  structured  way  in  which  new  services  can 
be  added,  and  their  modularity  simplifies  the  task  of  modifying  existing  services,  for 
example  to  support  high-availability  [4]. 


2.2  Exploiting  Parallelism 

There  are  two  main  schemes  utilised  by  parallel  relational  database  servers  for  query 
processing  on  distributed  memory  parallel  platforms.  These  are  usually  termed  Task 
Shipping  and  Data  Shipping.  They  are  described  and  compared  in  this  subsection  so 
that  in  Section  4  we  can  discuss  their  appropriateness  to  the  design  of  parallel  object 
database  servers. 

In  both  Task  and  Data  shipping  schemes,  the  database  tables  are  horizontally 
partitioned  across  a  set  of  disks  (i.e.  each  disk  stores  a  set  of  rows).  The  set  of  disks  is 
selected  to  be  distributed  over  a  set  of  nodes  (e.g.  it  may  consist  of  one  disk  per  node 
of  the  server).  The  effect  is  that  the  accesses  to  a  table  are  spread  over  a  set  of  disks 
and  nodes,  so  giving  greater  aggregate  throughput  than  if  the  table  was  stored  on  a 
single  disk.  This  also  ensures  that  the  table  size  is  not  limited  to  the  capacity  of  a 
single  disk. 

In  both  schemes,  each  node  runs  a  database  server  which  can  compile  and  execute  a 
query  or,  as  will  be  described,  execute  part  of  a  parallelised  query.  When  a  client 
sends  an  SQL  query  to  one  of  the  Communications  Nodes  of  the  parallel  server  it  is 
directed  to  a  node  where  it  is  compiled  and  executed  either  serially,  on  a  single  node, 
or  in  parallel,  across  a  set  of  nodes.  The  difference  between  the  two  schemes  is  in  the 
method  of  accessing  database  tables  as  is  now  described. 
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2.2.1  Data  Shipping 

In  the  data  shipping  scheme  the  parallel  server  runs  a  distributed  filesystem  which 
allows  each  node  to  access  data  from  any  disk  in  the  system.  Therefore,  a  query 
running  on  a  node  can  access  any  row  of  any  table,  irrespective  of  the  physical 
location  of  the  disk  on  which  the  row  is  stored. 

When  inter-query  parallelism  is  exploited,  a  set  of  external  clients  generate  queries 
which  are  sent  to  the  parallel  server  for  execution.  On  arrival,  the  Communications 
Node  forwards  them  to  another  node  for  compilation  and  execution.  Therefore  the 
Communication  Nodes  need  to  use  a  load  balancing  algorithm  to  select  a  node  for  the 
compilation  and  execution  of  each  query.  Usually,  queries  in  OLTP  workloads  are  too 
small  to  justify  parallelisation  and  so  each  query  is  executed  only  on  one  node.  When  a 
query  is  executed,  the  database  server  accesses  the  required  data  in  the  form  of  table 
rows.  Each  node  holds  a  database  cache  in  main  memory  and  so  if  a  row  is  already  in 
the  local  cache  it  can  be  accessed  immediately.  However,  if  it  is  not  in  the  cache  then 
the  distributed  filesystem  is  used  to  fetch  it  from  disk.  The  unit  of  transfer  between  the 
disk  and  cache  is  a  page  of  rows.  When  query  execution  is  complete  the  result  is 
returned  to  the  client. 

When  a  single  complex  query  is  to  be  executed  in  parallel  (intra-query  parallelism) 
then  parallelism  is  exploited  within  the  standard  operators  used  to  execute  queries: 
scan,  join  and  sort  [12].  For  example,  if  a  table  has  to  be  scanned  to  select  rows  which 
meet  a  particular  criteria  then  this  can  be  done  in  parallel  by  having  each  of  a  set  of 
nodes  scan  a  part  of  the  table.  The  results  from  each  node  are  appended  to  produce  the 
final  result.  Another  type  of  parallelism,  pipeline  parallelism,  can  be  exploited  in 
queries  which  require  multiple  levels  of  operators  by  streaming  the  results  from  one 
operator  to  the  inputs  of  others. 

Because,  in  the  Data  Shipping  model,  any  node  can  access  any  row  of  any  table  the 
compiler  is  free  to  decide  for  each  query  both  how  much  parallelism  it  is  sensible  to 
exploit,  and  the  nodes  on  which  it  should  be  executed.  Criteria  for  making  these 
decisions  include  the  granularity  of  parallelism  and  the  current  loading  of  the  server, 
but  it  is  important  to  note  that  these  decisions  are  not  constrained  by  the  location  of  the 
data  on  which  the  query  operates.  For  example,  even  if  a  table  is  only  partitioned  over 
the  disks  of  three  nodes,  a  compiler  can  still  choose  to  execute  a  scan  of  that  table  over 
eight  nodes. 

The  fact  that  any  node  can  access  data  from  disks  located  anywhere  in  the  parallel 
system  has  a  number  of  major  implications  for  the  design  of  the  parallel  database 
server: 

•  the  distributed  filesystem  must  meet  a  number  of  requirements  which  are  not  found 
in  typical,  conventional  distributed  filesystems  such  as  NFS.  These  include  high 
performance  access  to  both  local  and  remote  disks,  continuous  access  to  data  even 
in  the  presence  of  disk  and  node  failures,  and  the  ability  to  perform  synchronous 
writes  to  remote  disks.  Consequently,  considerable  effort  has  been  expended  in  this 
area  by  developers  of  parallel  database  platforms. 

•  a  global  lock  manager  is  required  as  a  resource  shared  by  the  entire  parallel  system. 
A  table  row  can  be  accessed  by  a  query  running  on  any  node  of  the  system. 
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Therefore  the  lock  which  protects  that  row  must  be  managed  by  a  single  component 
in  the  system  so  that  queries  running  on  different  nodes  of  the  system  see  the  same 
lock  state.  Consequently,  any  node  which  wishes  to  take  a  lock  must  communicate 
with  the  single  component  that  manages  that  lock.  If  all  locks  in  a  system  were  to 
be  managed  by  a  single  component  -  a  centralised  lock  manager  -  then  the  node  on 
which  that  component  ran  would  become  a  bottleneck,  reducing  the  system’s 
scalability.  Therefore,  parallel  database  systems  usually  implement  the  lock 
manager  in  a  distributed  manner.  Each  node  runs  an  instance  of  the  lock  manager 
which  manages  a  subset  of  the  locks  [1].  When  a  query  needs  to  acquire  or  drop  a 
lock  it  sends  a  message  to  the  node  whose  lock  manager  instance  manages  that 
lock.  One  method  of  determining  the  node  which  manages  a  lock  is  to  use  a  hash 
function  to  map  a  lock  identifier  to  a  node.  All  the  lock  manager  instances  need  to 
co-operate  and  communicate  to  determine  deadlocks,  as  circles  of  dependencies 
can  contain  a  set  of  locks  managed  by  more  than  one  lock  manager  instance  [13]. 

•  distributed  cache  management  is  required.  A  row  can  be  accessed  by  more  than  one 
node  and  this  raises  the  issue  that  a  row  could  be  cached  in  more  than  one  node  at 
the  same  time.  It  is  therefore  necessary  to  have  a  scheme  for  maintaining  cache 
coherency.  One  solution  is  the  use  of  cache  locks  managed  by  the  lock  manager. 

The  efficiency  of  database  query  execution  in  a  Data  Shipping  system  is  highly 
dependent  on  the  cache  hit  rate.  A  cache  miss  requires  a  disk  access  and  the  execution 
of  filesystem  code  on  two  nodes  (assuming  the  disk  is  remote),  which  increases 
response  time  and  reduces  throughput.  Two  techniques  can  be  used  to  reduce  the 
resulting  performance  degradation.  The  first  is  to,  where  possible,  direct  queries  which 
access  the  same  data  to  the  same  node.  For  example  all  banking  queries  from  one 
branch  could  be  directed  to  the  same  node.  This  increases  cache  hit  rates  and  reduces 
pinging:  the  process  by  which  cached  pages  have  to  be  moved  around  the  caches  of 
different  nodes  of  the  system  because  the  data  contained  in  them  is  updated  by  queries 
running  on  more  than  one  node.  The  second  technique  extends  the  first  technique  so 
that  not  only  are  queries  operating  on  the  same  data  sent  to  the  same  node,  but  that 
node  is  selected  as  the  one  which  holds  the  data  on  its  local  disks.  The  result  is  that  if 
there  is  a  cache  miss  then  only  a  local  disk  access  is  required.  This  not  only  reduces 
latency,  but  also  increases  throughput  as  less  code  is  executed  in  the  filesystem  for  a 
local  access  than  a  remote  access.  This  second  technique  is  closely  related  to  Task 
Shipping,  which  is  now  described. 


2.2.2  Task  Shipping 

The  key  difference  between  the  two  Shipping  schemes  is  that  in  Task  Shipping  only 
those  parts  of  tables  stored  on  local  disks  can  be  accessed  during  query  evaluation. 
Consequently,  each  query  must  be  decomposed  into  sub-queries,  each  of  which  only 
requires  access  to  the  table  rows  stored  on  a  single  node.  These  sub-queries  are  then 
shipped  to  the  appropriate  nodes  for  execution.  For  a  parallel  scan  operation,  this  is 
straightforward  to  organise,  though  it  does  mean  that,  unlike  in  the  case  of  Data 
Shipping,  the  parallel  speed-up  is  limited  by  the  number  of  nodes  across  which  the 
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table  is  partitioned,  unless  the  data  is  temporarily  re-partitioned  by  the  query.  When  a 
join  is  performed,  if  the  two  tables  to  be  joined  have  been  partitioned  across  the  nodes 
such  that  rows  with  matching  values  of  the  join  attribute  are  stored  on  disks  of  the 
same  node,  then  the  join  can  straightforwardly  be  carried  out  as  a  parallel  set  of  local 
joins,  one  on  each  node.  However,  if  this  is  not  the  case,  then  at  least  one  of  the  tables 
will  have  to  be  re-partitioned  before  the  join  can  be  carried  out.  This  is  achieved  by 
sending  each  row  of  the  table(s)  to  the  node  in  which  it  will  participate  in  a  local  join. 
Parallel  sorting  algorithms  also  require  this  type  of  redistribution  of  rows,  and  so  it 
may  be  said  that  the  term  Task  Shipping  is  misleading  because  in  common  situations  it 
is  necessary  to  ship  data  around  the  parallel  system.  However  because  data  is  only 
ever  accessed  from  disk  on  one  node  -  the  node  connected  to  the  disk  on  which  it  is 
stored,  then  some  aspects  of  the  system  design  are  simplified  when  compared  to  the 
Data  Shipping  scheme,  viz.: 

•  there  is  no  requirement  for  a  distributed  filesystem  because  data  is  always  accessed 
from  local  disk  (however,  in  order  that  the  system  be  resilient  to  node  failure,  it  will 
still  be  necessary  to  duplicate  data  on  a  remote  node  or  nodes  -  see  Section  2.3). 

•  there  is  no  requirement  for  global  cache  coherency  because  data  is  only  cached  on 
one  node  -  that  connected  to  the  disk  on  which  it  is  stored. 

•  lock  management  is  localised.  All  the  accesses  to  a  particular  piece  of  data  are 
made  only  from  one  node  -  the  node  connected  to  the  disk  on  which  it  is  stored. 
This  can  therefore  run  a  local  lock  manager  responsible  only  for  the  local  table 
rows.  However,  a  2-phase  commitment  protocol  is  still  required  across  the  server  to 
allow  transactions  which  have  been  fragmented  across  a  set  of  nodes  to  commit  in  a 
co-ordinated  manner.  Further,  global  deadlock  detection  is  still  required. 

An  implication  of  the  Task  Shipping  scheme  is  that  even  small  queries  which  contain 
little  or  no  parallelism  still  have  to  be  divided  into  sub-queries  if  they  access  a  set  of 
table  rows  which  are  not  all  stored  on  one  node.  This  will  be  inefficient  if  the  sub¬ 
queries  have  low  granularity. 


2.3  Availability 

If  a  system  is  to  be  highly  available  then  its  components  must  be  designed  to 
contribute  to  this  goal.  In  this  section  we  discuss  how  the  database  server  software  can 
be  designed  to  continue  to  operate  when  key  components  of  a  parallel  platform  fail. 

A  large  parallel  database  server  is  likely  to  have  many  disks,  and  so  disk  failure 
will  be  a  relatively  common  occurrence.  In  order  to  avoid  a  break  of  service  for 
recovery  when  a  disk  fails  it  is  necessary  to  duplicate  data  on  more  than  one  disk. 
Flexing  and  RAID  schemes  can  be  used  for  this.  In  Goldrush,  data  is  partitioned  over 
a  set  of  disk  volumes,  each  of  which  is  duplexed.  The  two  plexes  are  always  chosen  to 
be  remote  from  each  other,  i.e.  held  on  disks  not  connected  to  the  same  node,  so  that 
even  if  a  node  fails  then  at  least  one  plex  is  still  accessible. 

If  a  node  fails  then  the  database  server  running  on  that  node  will  be  lost,  but  this 
need  not  prevent  other  nodes  from  continuing  to  provide  a  service,  all  be  it  with 
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reduced  aggregate  performance.  Any  lock  management  data  held  in  memory  on  the 
failed  node  becomes  inaccessible  and  so,  as  this  may  be  required  by  transactions 
running  on  other  nodes,  the  lock  manager  design  must  ensure  that  the  information  is 
still  available  from  another  node.  This  can  be  achieved  by  plexing  this  information 
across  the  memories  of  two  nodes. 

Failure  of  an  external  network  link  may  cause  the  parallel  server  to  loose  contact 
with  an  external  client.  In  the  Goldrush  system,  the  parallel  server  holds  information 
on  alternative  routes  to  clients  and  monitors  the  external  network  connections  so  that  if 
there  is  a  failure  then  an  alternative  route  can  be  automatically  chosen. 


3  Parallel  ODBMS  Requirements 

In  this  section  we  describe  and  justify  the  requirements  which  we  believe  a  parallel 
ODBMS  server  should  meet.  A  major  source  of  requirements  is  derived  from  our 
experience  with  the  design  and  usage  of  the  ICL  Goldrush  MegaServer  [1,  5,  6]  which 
embodies  a  number  of  the  functions  and  attributes  that  will  also  be  required  in  a 
parallel  ODBMS.  However,  it  is  also  the  case  that  there  are  interesting  differences 
between  the  requirements  for  a  parallel  RDBMS,  such  as  Goldrush,  and  a  parallel 
ODBMS.  In  particular,  we  envisage  a  parallel  ODBMS  as  a  key  component  for 
building  high  performance  distributed  applications  as  it  has  attributes  not  found  in 
alternative  options  including:  performance,  availability,  rich  interfaces  for  accessing 
information  (including  querying),  and  transactional  capability  to  preserve  database 
integrity.  However,  if  a  parallel  ODBMS  is  to  fulfill  its  potential  in  this  area  then  we 
believe  that  there  are  a  set  of  requirements  which  it  must  meet,  and  these  are  discussed 
in  this  section. 

The  rest  of  this  section  is  structured  as  follows.  We  first  describe  the  overall  system 
requirements,  before  examining  the  non-functional  requirements  of  performance  and 
availability. 


3.1  Systems  Architecture 

Our  main  requirement  is  for  a  server  which  provides  high  performance  to  ODMG 
compatible  database  clients  by  exploiting  both  inter-query  and  intra-query  parallelism. 
This  requires  a  system  architecture  similar  to  that  described  for  RDBMS  in  Section  2. 
Client  applications  may  generate  OQL  queries  which  are  sent  for  execution  to  the 
parallel  server  (via  the  Communications  Nodes).  To  support  intra-query  parallelism 
the  server  must  be  capable  of  executing  these  queries  in  parallel  across  a  set  of  nodes. 
The  server  must  also  support  inter-query  parallelism  by  simultaneously  executing  a  set 
of  OQL  queries  sent  by  a  set  of  clients.  Clients  may  also  map  database  objects  into 
program  objects  in  order  to  (locally)  access  their  properties  and  call  methods  on  them. 
Therefore,  the  server  must  also  satisfy  requests  from  clients  for  copies  of  objects  in  the 
database.  The  roles  of  the  parallel  system,  serving  clients,  are  summarised  by  Figure  2. 
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Fig.  2.  Clients  of  the  Parallel  ODBMS  Server 

The  architecture  does  not  preclude  certain  clients  from  choosing  to  execute  OQL 
queries  locally,  rather  than  on  the  parallel  server.  This  would  require  the  client  to  have 
local  OQL  execution  capability,  but  the  client  would  still  need  to  access  the  objects 
required  during  query  execution  from  the  server.  Reasons  for  doing  this  might  be  to 
exploit  a  particularly  high  performance  client,  or  to  remove  load  from  the  server  so 
that  it  can  spend  more  of  its  computational  resources  on  other,  higher  priority  queries. 

Where  the  ODBMS  is  to  act  as  a  component  in  a  distributed  system,  we  require  that 
it  can  be  accessed  through  standard  CORBA  interfaces  [14].  The  ODMG  standard 
proposes  that  a  special  component  -  an  Object  Database  Adapter  (ODA)  -  is  used  to 
connect  the  ODBMS  to  a  CORBA  Object  Request  Broker  (ORB)  [3].  This  allows  the 
ODBMS  to  register  a  set  of  objects  as  being  available  for  external  access  via  the  ORB, 
CORBA  clients  can  then  generate  requests  for  the  execution  of  methods  on  these 
objects.  The  parallel  ODBMS  services  these  requests,  returning  the  results  to  the 
clients.  In  order  to  support  high  throughput,  the  design  of  the  parallel  ODBMS  must 
ensure  that  the  stream  of  requests  through  the  ODA  is  executed  in  parallel. 


3.2  Performance 

Whilst  high  absolute  performance  is  a  key  requirement,  experience  with  the 
commercial  usage  of  parallel  relational  database  servers  has  shown  that  scalability  is  a 
key  attribute.  It  must  be  possible  to  add  nodes  to  increase  the  server's  performance  for 
both  inter-query  and  intra-query  parallelism.  In  Section  2,  we  described  how  the 
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distributed  memory  parallel  architecture  allows  for  scalability  at  the  hardware 
platform  level,  but  this  is  only  one  aspect  of  achieving  database  server  scalability. 
Adding  a  node  to  the  system  provides  more  processing  power  which  may  be  used  to 
speed-up  the  evaluation  of  queries.  However  for  those  workloads  whose  performance 
IS  largely  dependent  on  the  performance  of  object  access,  then  adding  a  node  will  have 
little  or  no  immediate  effect.  In  the  parallel  ODBMS,  sets  of  objects  will  be 
partitioned  across  a  set  of  disks  and  nodes  (c.f.  the  design  of  a  parallel  RDBMS  as 
described  in  Section  2),  and  so  increasing  the  aggregate  object  access  throughput 
requires  re-partitioning  the  sets  of  objects  across  a  larger  number  of  disks  and  nodes. 

Experience  with  serial  ODBMS  has  shown  that  the  ability  to  cluster  on  disk  objects 
likely  to  be  accessed  together  is  important  for  reducing  the  number  of  disk  accesses 
and  so  increasing  performance  [15].  Indeed,  the  ODMG  language  bindings  provide 
versions  of  object  constructors  which  allow  the  programmer  to  specify  that  a  ne  v 
object  should  be  clustered  with  an  existing  object  [3].  Over  time,  the  pattern  of  access 
to  objects  may  change,  or  be  better  understood  (due  to  information  provided  by 
performance  monitoring),  and  so  support  for  re-clustering  is  also  required  to  allow 
performance  tuning. 

As  will  be  seen  in  Section  4,  the  requirement  to  support  the  re-partitioning  and  re¬ 
clustering  of  objects  has  major  implications  for  the  design  of  a  parallel  ODBMS. 


3.3  Performance  Management 

This  section  describes  requirements  which  relate  to  the  use  of  a  database  as  a  resource 
shared  among  a  set  of  services  with  different  performance  requirements. 

The  computational  resources  (CPU,  disk  lOs,  etc.)  made  available  by  high 
performance  systems  are  expensive  -  customers  pay  more  per  unit  resource  on  a  high 
performance  system  than  they  do  on  a  commodity  computer  system.  In  many  cases, 
resources  will  not  be  completely  exhausted  by  a  single  service  but  if  the  resources  are 
to  be  shared  then  mechanisms  are  needed  to  control  this  sharing. 

An  example  of  the  need  for  controlled  resource  sharing  is  where  a  database  runs  a 
business-critical  OLTP  workload  which  does  not  utilise  all  the  server’s  resources.  It 
may  be  desirable  to  run  other  services  against  the  same  database,  for  example  in  order 
m  perform  data  mining,  but  it  is  vital  that  the  performance  of  the  business-critical 
OLTP  service  is  not  affected  by  the  loss  of  system  resources  utilised  by  other  services. 
In  a  distributed  memory  parallel  server,  the  two  services  may  be  kept  apart  on  non- 
intersecting  sets  of  nodes  in  order  to  avoid  conflicts  in  the  sharing  of  CPU  resources. 
However,  they  will  still  have  to  share  the  disks  on  which  the  database  is  held.  A 
solution  might  be  to  analyse  the  disk  access  requirements  of  the  two  services  and  try  to 
ensure  that  the  data  is  partitioned  over  a  sufficiently  large  set  of  nodes  and  disks  so  as 
to  be  able  to  meet  the  combined  performance  requirements  of  both  services.  However 
this  is  only  possible  where  the  performance  requirements  of  the  two  services  are 
relatively  static  and  therefore  predictable.  This  may  be  true  of  the  OLTP  service,  but 
the  performance  of  a  complex  query  or  data  mining  service  may  not  be  predictable  A 
better  solution  would  be  to  ensure  that  the  OLTP  service  was  allocated  the  share  of  the 
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resources  that  it  required  to  meet  its  performance  requirements,  and  only  allow  the 
other  service  the  remainder. 

A  further  example  of  the  importance  of  controlled  resource  sharing  is  where 
parallel  ODBMS  act  as  repositories  for  persistent  information  on  the  Internet  and 
intranets.  If  information  is  made  available  in  this  way  then  it  is  important  that  the 
limited  resources  of  the  ODBMS  are  shared  in  a  controlled  way  among  clients.  For 
example  there  is  a  danger  of  denial  of  service  attacks  in  which  all  the  resources  of 
system  are  used  by  a  malicious  client.  Similarly,  it  would  be  easy  for  a  non-malicious 
user  to  generate  a  query  which  required  large  computational  resources,  and  so  reduced 
the  performance  available  to  other  clients  below  an  acceptable  level.  Finally,  it  may  be 
desirable  for  the  database  provider  to  offer  different  levels  of  performance  to  different 
classes  of  clients  depending  on  their  importance  or  payment. 

The  need  for  controlled  sharing  of  resources  has  long  been  recognised  in  the 
mainstream  computing  world.  For  example,  mainframe  computer  systems  have  for 
several  decades  provided  mechanisms  to  control  the  sharing  of  their  computational 
resources  among  a  set  of  users  or  services.  Some  systems  allow  CPU  time  and  disk 
lOs  to  be  divided  among  a  set  of  services  in  any  desired  ratio.  Priority  mechanisms  for 
allocating  resources  to  users  and  services  may  also  be  offered.  Such  mechanisms  are 
now  also  provided  for  some  shared  memory  parallel  systems. 

Unfortunately,  the  finest  granularity  at  which  these  systems  control  the  sharing  of 
resources  tends  to  be  at  an  Operating  System  process  level.  This  is  too  crude  for 
database  servers,  which  usually  have  at  their  heart  a  single,  multi-threaded  process 
which  executes  the  queries  generated  by  a  set  of  clients  (so  reducing  switching  costs 
when  compared  to  an  implementation  in  which  each  client  has  its  own  process  running 
on  the  server).  Therefore,  in  conclusion,  we  believe  that  the  database  server  must 
provide  its  own  mechanisms  to  share  resources  among  a  set  of  services  in  a  controlled 
manner  if  expensive  parallel  system  resources  are  to  be  fully  utilised,  and  if  object 
database  servers  are  to  fulfil  their  potential  as  key  components  in  distributed  systems. 


3.4  Availability 

The  requirements  here  are  identical  to  those  of  a  parallel  RDBMS.  The  parallel 
ODBMS  must  be  able  to  be  continue  to  operate  in  the  presence  of  node,  disk  and 
network  failures.  It  is  also  important  that  availability  is  considered  when  designing 
database  management  operations  which  may  effect  the  availability  of  the  database 
service.  These  include:  archiving,  upgrading  the  server  software  and  hardware,  and  re¬ 
partitioning  and  re-clustering  data  to  increase  performance.  Ideally  these  should  all  be 
achievable  on-line,  without  the  need  to  shut-down  the  database  service,  but  if  this  is 
not  possible  then  the  time  for  which  the  database  service  is  down  should  be 
minimised. 

A  common  problem  causing  loss  of  availability  during  management  operations  is 
human  error.  Manually  managing  a  system  with  up  to  tens  of  nodes,  and  hundreds  ol 
disks  is  very  error  prone.  Therefore  tools  to  automate  management  tasks  are  important 
for  reducing  errors  and  increasing  availability. 
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3.5  Hardware  Platform 

The  Polar  project  has  access  to  a  parallel  database  platform,  an  ICL  Goldrush 
MegaServer  [1],  however  we  are  also  investigating  alternatives.  Current  commercial 
parallel  database  platforms  use  custom  designed  components  (including  internal 
networks,  processor-network  interfaces  and  cabinetry),  which  leads  to  high  design, 
development  and  manufacturing  costs,  which  are  then  passed  on  to  users.  Such 
systems  generally  have  a  cost-performance  ratio  that  is  significantly  higher  than  that 
of  commodity,  uniprocessor  systems.  We  are  investigating  a  solution  to  this  problem 
and  have  built  a  parallel  machine,  the  Affordable  Parallel  Platform  (APP),  entirely 
from  standard,  low-cost,  commodity  components  -  high-performance  PCs  inter¬ 
connected  by  a  high  throughput,  scaleable  ATM  network.  Our  current  system  consists 
of  13  nodes  interconnected  by  155Mbps  ATM.  The  network  is  scaleable  as  each  PC 
has  a  full  155Mpbs  connection  to  the  other  PCs.  This  architecture  has  the  interesting 
property  that  high-performance  clients  can  be  connected  directly  to  the  ATM  switch 
giving  them  a  very  high  bandwidth  connection  to  the  parallel  platform.  This  may,  for 
example,  be  advantageous  if  large  multimedia  objects  must  be  shipped  to  clients. 

Our  overall  aim  in  this  area  is  to  determine  whether  commodity  systems  such  as  the 
APP  can  compete  with  custom  parallel  systems  in  terms  of  database  performance. 
Therefore  we  require  that  the  database  server  design  and  implementation  is  portable 
across,  and  tuneable  for,  both  types  of  parallel  platform. 


4  Parallel  ODBMS  Design 

In  this  section  we  consider  issues  in  the  design  of  a  parallel  ODBMS  to  meet  the 
requirements  discussed  in  the  last  section.  We  cover  object  access  in  the  most  detail  as 
It  IS  a  key  area  in  which  existing  parallel  RDBMS  and  serial  ODBMS  designs  do  not 
offer  a  solution.  Cache  coherency,  performance  management  and  query  processing  are 
also  discussed. 


4.1  Object  Access 

Objects  in  an  object  database  can  be  individually  accessed  through  their  unique 
identifier.  This  is  a  property  not  found  in  relational  databases  and  so  we  first  discuss 
the  issues  in  achieving  this  on  the  parallel  server  (we  leave  discussion  of  accessins 
collections  of  objects  until  Section  4,3). 

As  described  earlier,  the  sets  of  objects  stored  in  the  parallel  server  are  partitioned 
across  a  set  of  disks  and  nodes  in  order  to  provide  a  high  aggregate  access  throughput 
to  the  set  of  objects.  When  an  object  is  created,  a  decision  must  be  made  as  to  the  node 
on  which  it  will  be  stored.  When  a  client  needs  to  access  a  persistent  object  then  it 
must  send  the  request  to  the  node  holding  the  object. 
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In  the  rest  of  this  section  we  discuss  in  more  detail  the  various  issues  in  the  storage 
and  accessing  of  objects  in  a  parallel  ODBMS  as  these  differ  significantly  from  the 
design  of  a  serial  ODBMS  or  a  parallel  RDBMS.  Firstly  we  discuss  how  the  Task 
Shipping  and  Data  Shipping  mechanisms  described  in  Section  2  apply  to  a  parallel 
ODBMS.  We  then  discuss  possible  options  for  locating  and  accessing  persistent 
objects. 

4.1.1  Task  Shipping  vs.  Data  Shipping 

In  a  serial  ODBMS,  the  database  server  stores  the  objects  and  provides  a  run-time 
system  which  allows  applications  running  on  external  clients  to  transparently  access 
objects.  Usually  a  page  server  interface  is  offered  to  clients,  i.e.  a  client  requests  an 
object  from  the  server  which  responds  with  a  page  of  objects  [7],  The  page  is  cached 
in  the  client  because  it  is  likely  that  the  requested  object  and  others  in  the  same  page 
will  be  accessed  in  the  near  future.  This  is  a  Data  Shipping  architecture,  and  has  the 
implication  that  the  same  object  may  be  cached  in  more  than  one  client,  so 
necessitating  a  cache  coherency  mechanism.  If  Data  Shipping  was  also  adopted  within 
the  parallel  server  then  nodes  executing  OQL  would  also  access  and  cache  objects 
from  remote  nodes. 

We  now  consider  if  the  alternative  used  in  some  parallel  RDBMS  -  a  Task  Shipping 
architecture  -  is  a  viable  option  for  a  parallel  ODBMS.  There  are  two  computations 
which  can  be  carried  out  on  objects:  executing  a  method  on  an  object  and  accessing  a 
property  of  an  object.  These  computations  can  occur  either  in  an  application  running 
on  an  external  client,  or  during  OQL  execution  on  a  node  of  the  parallel  server. 

In  order  to  implement  task  shipping,  the  external  clients,  and  nodes  executing  OQL 
would  not  cache  remote  objects  and  process  them  locally.  Instead,  they  would  have  to 
send  a  message  to  the  node  which  stored  the  object  requesting  that  it  perform  the 
computation  (method  execution  or  property  access)  locally  and  return  the  result.  In 
this  way,  the  client  does  not  cache  objects,  and  so  cache  coherency  is  not  an  issue. 
However  Task  Shipping  in  object  database  servers  does  have  two  major  drawbacks. 
Firstly,  the  load  on  the  server  is  increased  when  compared  to  the  data  shipping 
scheme.  This  means  that  a  server  is  likely  to  be  able  to  support  fewer  clients. 
Secondly,  the  latency  of  task  shipping  -  the  time  between  sending  the  task  to  the  server 
and  receiving  the  response  -  is  likely  to  result  in  performance  degradation  at  the  client 
when  compared  to  the  data  shipping  scheme  in  which  the  client  can  operate  directly  on 
the  object  without  any  additional  latency  once  it  has  been  cached.  Many  database 
clients  are  likely  to  be  single  application  systems,  and  so  there  will  not  be  other 
processes  to  run  while  a  computation  on  an  object  is  being  carried  out  on  the  parallel 
server  -  consequently  the  CPU  will  be  idle  during  this  time.  Further,  the  user  of  an 
interactive  client  application  will  be  affected  by  the  increase  in  response  time  caused 
by  the  latency. 

The  situation  is  slightly  different  for  the  case  of  OQL  query  execution  on  the  nodes 
of  the  server  under  a  Task  Shipping  scheme.  When  query  execution  is  suspended 
while  a  computation  on  an  object  is  executed  remotely,  it  is  very  likely  that  there  will 
be  other  work  to  do  on  the  node  (for  example,  servicing  requests  for  operations  on 
objects  and  executing  other  queries)  and  so  the  CPU  will  not  be  idle.  However,  it  is 
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likely  that  the  granularity  of  parallel  processing  (the  ratio  of  computation  to 
communication)  will  be  low  due  to  the  frequent  need  to  generate  requests  for 
computations  on  remote  objects.  Therefore,  although  the  node  may  not  be  idle,  it  will 
be  executing  frequent  task  switches  and  sending/receiving  requests  for  computations 
on  objects.  The  effect  of  this  is  likely  to  be  a  reduction  in  throughput  when  compared 
to  a  data  shipping  scheme  with  high  cache  hit  rates  which  will  allow  a  greater 
granularity  of  parallelism. 

For  these  reasons,  we  do  not  further  consider  a  task  shipping  scheme,  and  so  must 
address  the  cache  coherence  issue  (Section  4.2).  Figure  3  shows  the  resulting  parallel 
database  server  architecture.  The  objects  in  the  database  are  partitioned  across  disks 
connected  to  the  nodes  of  the  system,  taking  into  account  clustering.  Requests  to 
access  objects  will  come  from  both;  applications  running  on  external  database  clients; 
and  the  nodes  of  the  parallel  server  which  are  executing  queries  or  processing  requests 
for  method  execution  from  CORBA  clients.  The  database  client  applications  are 
supported  by  a  run-time  system  (RTS)  which  maintains  an  object  cache.  If  the 
application  accesses  an  object  which  is  not  cached  then  an  object  request  is  generated 
and  sent  to  a  Communication  Node  (CN)  of  the  parallel  server.  The  CN  uses  a  Global 
Object  Access  module  to  determine  the  node  which  has  the  object  stored  on  one  of  its 
local  disks.  The  Remote  Object  Access  component  forwards  the  request  to  this  node 
where  it  is  serviced  by  the  Local  Object  Access  component  which  returns  the  page 
containing  the  object  back  to  the  client  application  RTS  via  the  CN. 

If  the  client  application  issues  an  OQL  query  then  the  client’s  database  RTS  sends 
it  to  a  CN  of  the  parallel  server.  From  there  it  is  forwarded  to  a  node  for  compilation. 
This  generates  an  execution  plan  which  is  sent  to  the  Query  Execution  components  on 
one  or  more  nodes  for  execution.  The  Query  Execution  components  on  each  node  are 
supported  by  a  local  run  time  system,  identical  to  that  found  on  the  external  database 
clients,  which  accesses  and  caches  objects  as  they  are  required,  making  use  of  the 
Global  Object  Access  component  to  locate  the  objects.  The  question  of  how  objects 
are  located  is  key,  and  is  discussed  in  the  next  section. 

If  there  are  CORBA  based  clients  generating  requests  for  the  execution  of  object 
methods  then  the  requests  will  be  directed  (by  the  ORB)  to  a  CN,  In  order  to  maximise 
the  proportion  of  local  object  accesses  (which  are  more  efficient  than  remote  object 
accesses)  then  the  CN  will  route  the  request  to  the  node  which  stores  the  object  for 
execution.  However,  there  will  be  circumstances  in  which  this  will  lead  to  an 
imbalance  in  the  loading  on  the  nodes,  for  example  if  one  object  frequently  occurs  in 
the  requests.  In  these  circumstances,  the  CN  may  choose  to  direct  the  request  to 
another,  more  lightly  loaded,  node. 
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Fig.  3.  Database  Server  System  Architecture 


4.1.2  Persistent  Object  Access 

In  this  section  we  describe  the  major  issues  in  locating  objects.  These  include  the 
structure  of  object  identifiers  (OIDs)  and  the  mechanism  by  which  an  OID  is  mapped 
to  a  persistent  object  identifier  (PID).  A  PID  is  the  physical  address  of  the  object,  i.e. 
it  includes  information  on  the  node,  disk  volume,  page  and  offset  at  which  the  object 
can  be  accessed. 

Based  on  the  requirements  given  in  Section  3,  the  criteria  which  an  efficient  object 
access  scheme  for  a  parallel  server  must  meet  are: 

•  a  low  OID  to  PID  mapping  cost.  This  could  be  a  significant  contribution  to  the  total 
cost  of  an  object  access  and  is  particularly  important  as  the  Communications  Nodes 
(which  are  only  a  subset  of  the  nodes  in  the  system)  must  perform  this  mapping  for 
all  the  object  requests  from  external  ODBMS  and  CORBA  clients.  We  want  to 
minimise  the  number  of  CNs  as  they  will  to  be  more  expensive  than  other  nodes, 
and  because  the  maximum  number  of  CNs  in  a  parallel  platform  may  be  limited  for 
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physical  design  reasons.  The  OID  to  PID  mapping  will  also  have  to  be  performed 
on  the  other  nodes  when  the  Query  Execution  components  access  objects. 

•  minimising  the  number  of  nodes  involved  in  an  object  access  as  each  inter-node 
message  contributes  to  the  cost  of  the  access. 

•  ability  to  re-cluster  objects  if  access  patterns  change. 

•  ability  to  re-partition  sets  of  objects  over  a  different  number  of  nodes  and  disks  to 
meet  changing  performance  requirements. 

•  a  low  object  allocation  cost. 

We  now  consider  a  set  of  possible  options  for  object  access  and  consider  how  they 

match  up  against  these  criteria. 

4. 1.2. 1  Physical  OID  schemes 

In  this  option,  the  PID  is  directly  encoded  in  the  OID.  For  example,  the  OID  could 

have  the  structure: 


Node 

Disk  Volume 

Page 

Offset 

The  major  advantage  of  this  scheme  is  that  the  cost  of  mapping  the  OID  to  the  PID 
is  zero.  The  Communications  Nodes  (CNs)  only  have  to  examine  the  OIDs  within 
incoming  object  requests  from  external  clients  and  forward  the  request  to  the  node 
contained  in  the  OID.  This  is  just  a  low-level  routing  function  and  could  probably  be 
most  efficiently  performed  at  a  low  level  in  the  communications  stack.  Therefore,  the 
load  on  a  CN  incurred  by  each  external  object  access  will  be  low.  Similarly,  object 
accesses  generated  on  a  node  by  the  Query  Execution  component  can  be  directed 
straight  to  the  node  which  owns  the  object.  Therefore  this  scheme  does  minimise  the 
number  of  nodes  involved  in  an  access.  However,  because  the  OID  contains  the  exact 
physical  location  of  the  object,  it  is  not  possible  to  move  an  object  to  a  different  page, 
disk  volume  or  node.  This  rules  out  re-clustering  or  re-partitioning. 

When  an  object  is  to  be  created  then  the  first  step  is  to  select  a  node.  This  could  be 
done  by  the  CN  using  a  round-robin  selection  scheme  to  evenly  spread  out  the  set  of 
objects.  The  object  is  then  created  on  the  chosen  node.  Therefore  object  creation  is  a 
low  cost  operation. 

Therefore,  this  scheme  will  perform  well  in  a  stable  system  but  as  it  does  not  allow 
re-partitioning  or  re-clustering,  it  is  likely  that  over  time  the  system  will  become  de¬ 
tuned  as  the  server’s  workload  changes. 

We  now  consider  three  possible  variations  of  the  scheme  which  attempt  to 
overcome  this  problem: 

Indirections.  The  OID  scheme  used  by  02  [7]  includes  a  physical  volume  number 
in  the  OID,  and  so  runs  into  similar  problems  if  an  Object  is  to  be  moved  to  a  different 
volume.  The  02  solution  is  to  place  an  indirection  at  the  original  location  of  the  object 
pointing  to  the  new  location  of  the  object.  Therefore,  potentially  two  disk  accesses 
might  be  incurred  and  (in  a  parallel  server)  two  nodes  might  be  involved  in  an  object 
access.  Also,  this  is  not  a  viable  solution  for  re-partitioning  a  set  of  objects  across  a 
larger  number  of  disks  and  nodes  in  order  to  increase  the  aggregate  throughput  to 
those  objects.  The  initial  accesses  to  all  the  objects  will  still  have  to  go  to  the 
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indirections  on  the  original  disks/nodes  and  so  they  will  still  be  a  bottleneck.  For  this 
reason  we  disregard  this  variant  as  a  solution. 

Only  include  the  Physical  Node  in  the  OBD.  In  this  variant,  the  OID  structure 
would  contain  the  physical  node  address,  and  then  a  logical  address  within  the  node, 
i.e.  : 


Node 


Logical  address  within  node 


The  logical  address  would  be  mapped  to  the  physical  address  at  the  node;  methods 
for  this  include  hash  tables  and  B-trees  (the  Direct  Mapping  scheme[16]  is  however 
not  an  option  as  it  is  not  scaleable:  once  allocated,  the  handles  which  act  as 
indirections  to  the  objects  cannot  be  moved).  This  would  allow  re-clustering  within  a 
node  by  updating  the  mapping  function,  and  re-clustering  across  nodes  using 
indirections,  but  it  would  not  support  re-partitioning  across  more  nodes  as  the  Node 
address  is  still  fixed  by  the  OID.  Therefore  we  disregard  this  variant  as  a  solution. 

Update  OIDs  when  an  Object  is  Moved.  If  this  could  be  implemented  efficiently 
then  it  would  have  a  number  of  advantages.  Objects  could  be  freely  moved  within  and 
between  nodes  for  re-clustering  and  re-partitioning,  but  because  their  OID  would  be 
updated  to  contain  their  new  physical  address  then  the  mapping  function  would 
continue  to  have  zero  cost,  and  no  indirections  would  be  required.  However,  such  a 
scheme  would  place  restrictions  on  the  external  use  of  OIDs.  Imagine  a  scenario  in 
which  a  client  application  acquires  an  OID  and  stores  it.  The  parallel  ODBMS  then 
undergoes  a  re-organisation  which  changes  the  location  of  some  objects,  including  the 
one  whose  OID  was  stored  by  the  client  application.  That  OID  is  now  incorrect  and 
points  to  the  wrong  physical  object  location.  A  solution  to  this  problem  is  to  limit  the 
temporal  scope  of  OIDs  to  a  client’s  database  session.  The  ODMG  standard  allows 
objects  to  be  named,  and  provides  a  mapping  function  to  derive  the  object’s  OID  from 
its  name.  It  could  be  made  mandatory  for  external  database  clients  to  only  refer  to 
objects  by  name  outside  of  a  database  session.  Within  the  session  the  client  would  use 
the  mapping  function  to  acquire  the  OID  of  an  object.  During  the  session,  this  OID 
could  be  used,  but  when  the  session  ended  then  the  OID  would  cease  to  have  meaning 
and  at  the  beginning  of  subsequent  sessions  OIDs  would  have  to  re-acquired  using  the 
name  mapping  function.  This  protocol  leaves  the  database  system  free  to  change  OIDs 
while  there  are  no  live  database  sessions  running.  When  an  object  is  moved  then  the 
name  mapping  tables  would  be  updated.  Any  references  to  the  object  from  other 
objects  in  the  database  would  also  have  to  be  updated.  If  each  reference  in  a  database 
is  bi-directional  then  this  process  is  simplified,  as  all  references  to  an  object  could  be 
directly  accessed  and  updated.  The  ODMG  standard  does  permit  unidirectional 
references  but  these  can  still  be  implemented  bi-directionally.  Another  option  is  to 
scan  the  whole  database  to  locate  any  references  to  the  object  to  be  moved.  In  a  large 
database  then  this  could  be  prohibitively  time-consuming.  An  alternative  is  the  Thor 
indirection  based  scheme  [10],  in  which,  over  a  period  of  time,  references  are  updated 
to  remove  indirections. 


A  name  based  external  access  scheme  could  be  used  for  CORBA  client  access  to 
objects.  Objects  could  be  registered  with  an  ORB  by  name.  Therefore,  incoming 
requests  from  CORBA  based  clients  would  require  the  object  name  to  be  mapped  to 
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an  OID  at  a  CN.  A  CORBA  client  could  also  navigate  around  a  database  so  long  as  it 
started  from  a  named  object,  and  provided  that  the  OIDs  of  any  other  objects  that  it 
reached  were  registered  with  the  ORB.  However,  if  the  client  wished  to  store  a 
reference  to  an  object  for  later  access  (outside  the  current  session)  then  it  would  have 
to  ask  the  ODBMS  to  create  a  name  for  the  object.  The  name  would  be  stored  in  the 
mapping  tables  on  the  CNs,  ready  for  subsequent  accesses.  This  procedure  can,  for 
example,  be  used  to  create  interlinked,  distributed  databases  in  which  references  in 
one  database  can  refer  to  objects  in  another. 

4. 1.2.2  Logical  OID  schemes 

At  the  other  extreme  from  the  last  scheme  is  one  in  which  the  OID  is  completely 
logical  and  so  contains  no  information  about  the  physical  location  of  the  object  on 
disk.  Consequently,  each  object  access  requires  a  full  logical  to  physical  mapping  to 
determine  the  PID  of  the  object.  Methods  of  performing  this  type  of  mapping  have 
been  extensively  studied  with  respect  to  a  serial  ODBMS  [16],  however  their 
appropriateness  for  a  parallel  system  requires  further  investigation. 

As  stated  above,  the  logical  to  physical  mapping  itself  can  be  achieved  by  a  number 
of  methods  including  hash  tables  and  B-trees.  Whatever  method  is  used  for  the 
mapping,  the  Communication  Nodes  will  have  to  perform  the  mapping  for  requests 
received  by  external  clients.  This  may  require  disk  accesses  as  the  mapping  table  for  a 
large  database  will  not  fit  into  main  memory.  Therefore  there  will  be  a  significant  load 
on  the  CNs,  and  it  will  be  necessary  to  ensure  that  there  are  enough  CNs  configured  in 
a  system  so  that  they  are  not  bottlenecks.  Similarly,  the  nodes  executing  OQL  will  also 
have  to  perform  the  expensive  OID  to  PID  mapping.  However,  once  the  mapping  has 
been  carried  out  then  the  request  can  be  routed  directly  to  the  node  on  which  the  object 
is  located. 

As  all  nodes  will  need  to  perform  the  OID  to  PID  mapping,  then  each  will  need  a 
complete  copy  of  the  mapping  information.  We  rule  out  the  alternative  solution  of 
having  only  a  subset  of  the  nodes  able  to  perform  this  mapping  due  to  the  extra 
message  passing  latency  that  this  would  add  to  each  object  access.  This  has  two 
implications;  each  node  needs  to  allocate  storage  space  to  hold  the  mapping;  and 
whenever  a  new  object  is  created  then  the  mapping  information  in  all  nodes  needs  to 
be  updated,  which  results  in  a  greater  response  time,  and  worse  throughput  for  object 
creation  than  was  the  case  for  the  Physical  OID  scheme  described  in  the  last  section. 

With  this  scheme,  unlike  the  Physical  OID  scheme,  names  are  not  mandatory  for 
storing  external  references  to  objects.  As  OIDs  are  unchanging,  they  can  be  used  both 
internally  as  well  as  externally.  This  removes  the  need  for  CNs  to  map  names  to  OIDs 
for  incoming  requests  from  CORBA  clients. 

4. 1.2.3  Virtual  Address  Schemes 

There  have  been  proposals  to  exploit  the  large  virtual  address  spaces  available  in  some 
modern  processors,  in  object  database  servers,  both  parallel  [17]  and  serial  [18], 
Persistent  objects  are  allocated  into  the  virtual  address  space  and  their  virtual  address 
is  used  as  their  OID.  This  has  the  benefit  of  simplifying  the  task  of  locating  a  cached 
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object  in  main  memory.  However  an  OID  to  PID  mapping  mechanism  is  still  required 
when  a  page  fault  occurs  and  the  page  containing  an  object  must  be  retrieved  from 

disk.  The  structure  of  a  typical  virtual  address  is: _ 

Segment  Page  Offset 

In  a  parallel  ODBMS,  when  a  page  fault  occurs  then  the  segment  and  page  part  of 
the  address  is  mapped  to  a  physical  page  address.  The  page  is  fetched,  installed  in  real 
memory  and  the  CPU’s  memory  management  unit  tables  are  configured  so  that  any 
accesses  to  OIDs  in  that  page  are  directed  to  the  correct  real  memory  address. 

The  key  characteristic  of  this  scheme  is  that  the  OID  to  PID  mapping  is  done  at  the 
page  level,  i.e.  in  the  [Segment,  Page,  Offset]  structure  shown  above  only  the  Segment 
and  Page  fields  are  used  in  determining  the  physical  location  of  the  object.  Therefore 
it  is  possible  to  re-partition  sets  of  objects  by  changing  the  mapping,  but  it  is  not 
possible  to  re-cluster  objects  within  pages  unless  changing  OIDs,  or  indirections  are 
supported  (as  described  in  Section  4. 1.2.1). 

The  OID  to  PID  mapping  can  be  done  using  the  logical  or  physical  schemes 
described  earlier,  with  all  their  associated  advantages  and  disadvantages.  However,  it 
is  worth  noting  that  if  a  logical  mapping  is  used  then  the  amount  of  information  stored 
will  be  smaller  than  that  required  for  the  logical  OID  scheme  of  Section  4. 1.2. 2  as 
only  pages,  rather  than  individual  objects  must  be  mapped. 

4.J.2.4  Logical  Volume  Scheme 

None  of  the  above  schemes  is  ideal,  and  so  we  have  also  investigated  the  design  of  an 
alternative  scheme  which  allows  the  re-partitioning  of  data  without  the  need  to  update 
addresses  nor  incur  the  cost  of  an  expensive  logical  to  physical  mapping  in  order  to 
determine  the  node  on  which  an  object  is  located.  In  describing  the  scheme  we  will 
used  the  term  volume  set  to  denote  the  set  of  disk  volumes  across  which  the  objects  in 
a  class  are  partitioned.  OIDs  have  the  following  structure: 

Class  Logical  Volume  Body 

The  Class  field  uniquely  identifies  the  class  of  the  object.  As  will  be  described  in 
detail  later,  the  Logical  Volume  field  is  used  to  ensure  that  objects  which  must  be 
clustered  are  always  located  in  the  same  disk  volume,  even  after  re-partitioning.  The 
size  of  this  field  is  chosen  so  that  its  maximum  value  is  much  greater  than  the 
maximum  number  of  physical  volumes  that  might  exist  in  a  parallel  system. 

Each  node  of  the  server  has  access  to  a  definition  of  the  Volume  Set  for  each  class. 
This  allows  each  node,  when  presented  with  an  OID,  to  use  the  Class  field  of  the  OID 
to  determine  the  number  and  location  of  the  physical  volumes  in  the  Volume  Set  of 
the  class  to  which  the  object  belongs. 

If  we  denote  the  cardinality  of  Volume  Set  V  as  IVI,  then  the  physical  volume  on 
which  the  object  is  stored  is  calculated  cheaply  as: 

Physical  Volume  =  V[  LogicalVolume  modulus  IVI  ] 


997 


FEUP  •  Faculdade  de  Engenharia  da  Universidade  do  Porto 


where  the  notation  V[i]  denotes  the  i’th  Volume  in  the  Volume  Set  V.  A  node 
wishing  to  access  the  object  uses  this  information  to  forward  the  request  to  the  node  on 
which  the  object  is  held.  On  that  node,  the  unique  serial  number  for  the  object  within 
the  physical  volume  can  be  calculated  as: 

Serial  =  (Body  div  |V|)  +  |Body|  *  (Logical Volume  div  |V|) 

where:  a  div  b  is  the  whole  number  part  of  the  division  of  a  by  b,  and  IBodyl  is  the 
number  of  unique  values  that  the  body  field  can  take. 

The  Serial  number  can  then  be  mapped  to  a  PID  using  a  local  logical  to  physical 
mapping  scheme. 

If  it  is  necessary  to  re-partition  a  class  then  this  can  be  done  by  creating  a  new 
volume  set  with  a  different  cardinality  and  copying  each  object  from  the  old  volume 
set  into  the  correct  volume  of  the  new  volume  set  using  the  above  formula.  Therefore 
a  class  of  objects  can  be  re-partitioned  over  an  arbitrary  set  of  volumes  without  having 
to  change  the  OIDs  of  those  objects. 

The  main  benefit  of  the  scheme  is  that  objects  in  the  same  class  with  the  same 
Logical  Volume  field  will  always  be  mapped  to  the  same  physical  volume.  This  is  still 
true  even  after  re-partitioning,  and  so  ensures  that  objects  clustered  before  partitioning 
can  still  be  clustered  afterwards.  Therefore,  when  objects  which  must  be  clustered 
together  are  created  they  should  be  given  OIDs  with  the  same  value  in  the  Logical 
Volume  field.  To  ensure  that  the  OID  is  unique  they  must  have  different  values  for 
their  Body  fields. 

Where  it  is  desirable  to  cluster  objects  of  different  classes  then  this  can  be  achieved 
by  ensuring  that  both  classes  share  the  same  volume  set,  and  that  objects  which  are  to 
be  clustered  share  the  same  Logical  Volume  value.  This  ensures  that  they  are  always 
stored  on  the  same  physical  volume. 

When  an  external  client  application  needs  to  create  a  new  object  then  it  sends  the 
request,  including  the  class  of  the  object,  to  a  Communication  Node.  These  hold 
information  on  all  the  volume  sets  in  the  server,  and  so  can  select  one  physical  volume 
from  that  set  (using  a  round-robin  or  random  algorithm).  The  CN  then  forwards  the 
create  object  request  to  the  node  that  contains  the  chosen  physical  volume,  and  the 
OID  is  then  generated  on  that  node  as  follows.  Each  node  contains  the  definition  of  the 
volume  set  for  each  class.  The  required  values  of  the  Class  and  Logical  Volume  fields 
of  the  object  will  have  been  sent  to  the  node  by  the  CN.  The  node  also  keeps 
information  on  the  free  Body  values  for  each  of  the  logical  volumes  which  map  to  a 
physical  volume  connected  to  that  node.  The  node  can  therefore  choose  a  free  Body 
value  for  the  Logical  Volume  chosen  by  the  CN.  This  completes  the  OID.  The  object 
is  then  allocated  to  a  physical  address  and  the  local  OID  mapping  information  updated 
to  reflect  this  new  OID  to  PID  mapping. 

The  creation  algorithm  is  different  in  cases  where  it  is  desirable  to  create  a  new 
object  which  is  to  be  clustered  with  an  existing  object.  The  new  object  must  be  created 
with  the  same  Logical  Volume  field  setting  as  the  existing  object  so  that  it  will  always 
by  stored  on  the  same  Physical  Volume  as  the  existing  object,  even  after  a  re-partition. 
Therefore  it  is  this  Logical  Volume  value  which  is  used  to  determine  the  node  on 
which  the  object  is  created.  On  that  node  the  Body  field  of  the  new  object  will  be 
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selected,  and  the  object  will  be  allocated  into  physical  disk  storage  clustered  with  the 
existing  object  (assuming  that  space  available). 

This  scheme  therefore  allows  us  to  define  a  scaleable,  object  manager  in  which  it  is 
possible  to  re-partition  a  set  of  objects  in  order  to  meet  increased  performance 
requirements.  Existing  clustering  can  be  preserved  over  re-partitioning  of  the  data, 
however  if  re-clustering  is  required  then  it  would  have  to  be  done  by  either  using 
indirections  or  by  updating  OIDs.  Determining  the  node  on  which  an  object  resides  is 
cheap  and  this  should  reduce  the  time  required  on  the  CNs  to  process  incoming  object 
access  requests.  The  vast  bulk  of  the  information  required  to  map  an  OID  to  a  PID  is 
local  to  the  node  on  which  the  object  is  stored,  so  reducing  the  total  amount  of  storage 
in  the  system  required  for  OID  mapping,  and  reducing  the  response  time  and 
execution  cost  of  object  creation. 

4. 1.2.5  Summary 

The  above  schemes  all  have  advantages  and  disadvantages: 

•  The  Physical  OID  scheme  has  a  low  run  time  cost  but  our  requirements  for  re¬ 
clustering  and  re-partitioning  can  only  be  met  if  OIDs  can  be  changed.  However, 
the  use  of  names  as  permanent  identifiers  does  permit  this. 

•  The  Logical  OID  system  supports  both  re-partitioning  and  re-clustering.  However  it 
is  expensive  both  for  object  creation  and  in  terms  of  the  load  it  places  on  the  CNs. 

•  The  Virtual  Address  scheme  supports  re-partitioning,  but  re-clustering  is  only 
possible  through  indirections  or  by  changing  OIDs.  Creating  an  object  is  costly 
when  a  new  page  has  to  be  allocated  as  it  requires  an  updating  of  the  OID  to  PID 
mapping  on  all  nodes.  Also,  the  cost  of  mapping  OIDs  to  PIDs  for  accesses  from 
external  clients  may  place  a  significant  load  on  the  CNs. 

•  The  Logical  Volume  scheme  may  be  a  reasonable  compromise  between  completely 
logical  and  completely  physical  schemes.  However  indirections  or  changeable 
OIDs  are  required  for  re-clustering. 

We  are  therefore  investigating  these  schemes  further  in  order  to  perform  a  quantitative 
comparison  between  them. 

4.2  Concurrency  Control  and  Cache  Management 

Concurrency  control  is  important  in  any  system  in  which  multiple  clients  can  be 
accessing  an  object  simultaneously.  In  our  system,  each  node  runs  an  Object  Manager 
which  is  responsible  for  the  set  of  local  objects  including:  concurrency  control  when 
those  objects  are  accessed;  and  the  logging  mechanisms  required  to  ensure  that  it  is 
possible  to  recover  of  the  state  of  those  objects  after  a  node  or  system  failure. 

In  the  system  we  have  described  there  are  three  types  of  clients  of  the  Object 
Managers:  database  applications  running  on  external  clients;  Query  Execution 
components  on  the  parallel  server  nodes;  and,  also  on  the  nodes,  object  method  calls 
instigated  by  CORBA  based  clients.  Therefore  there  are  caches  both  on  the  nodes  of 
the  server,  and  in  the  external  database  clients.  More  than  one  client  may  require 
access  to  a  particular  object  and  so  concurrency  control  and  cache  coherency 
mechanisms  must  manage  the  fact  that  an  object  may  be  held  in  more  than  one  cache. 
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This  situation  is  not  unique  to  parallel  systems,  and  occurs  on  any  client-server  object 
database  system  which  supports  multiple  clients.  A  number  of  concurrency 
control/cache  coherency  mechanisms  have  been  proposed  for  this  type  of  system  [19], 
and  there  appears  to  be  nothing  special  about  parallel  systems  which  prevents  one  of 
these  from  being  adopted. 


4.3  Query  Execution 

In  a  parallel  system  we  wish  to  be  able  to  improve  the  response  time  of  individual 
OQL  queries  by  exploiting  intra-query  parallelism.  In  this  section  we  discuss  some  of 
the  issues  in  achieving  this. 

An  OQL  query  generated  by  a  client  will  be  received  first  by  a  Communications 
Node  which  will  use  a  load  balancing  scheme  to  choose  a  node  on  which  to  compile 
the  query.  Compilation  generates  an  execution  plan:  a  graph  whose  nodes  are 
operations  on  objects,  and  whose  arcs  represent  the  flow  of  objects  from  the  output  of 
one  operation  to  the  input  of  another.  As  described  earlier,  the  system  we  propose 
supports  data  shipping  but  for  the  reasons  discussed  in  Section  2.2.1,  it  will  give 
performance  benefits  if  operations  are  carried  out  on  local  rather  than  remote  objects 
where  possible.  To  achieve  this  it  is  necessary  to  structure  collections  which  are  used 
to  access  objects  (e.g.  extents)  so  as  to  allow  computations  to  be  parallelised  on  the 
basis  of  location.  One  option  [17]  is  to  represent  collections  of  objects  through  two- 
level  structures.  At  the  top  level  is  a  collection  of  sets  of  objects  with  one  set  per  node 
of  the  parallel  system.  Each  set  contains  only  the  objects  held  on  one  node.  This 
allows  computations  on  collections  of  objects  to  be  parallelised  in  a  straightforward 
manner;  each  node  will  run  an  operation  which  processes  those  objects  stored  locally. 
However,  in  our  data  shipping  system  if  the  objects  in  a  collection  are  not  evenly 
partitioned  over  the  nodes  then  it  is  still  possible  to  partition  the  work  of  processing 
the  objects  among  the  nodes  in  a  way  which  is  not  based  on  the  locations  of  the 
objects.  One  advantage  of  the  Logical  Volume  OID  mapping  scheme  described  in 
Section  4. 1.2.4  is  that  it  removes  the  need  to  explicitly  maintain  these  structures  for 
class  extents  as  the  existing  OID  to  PID  mapping  information  makes  it  possible  to 
locate  all  the  objects  in  a  given  class  stored  on  the  local  node. 

A  major  difference  between  object  and  relational  databases  is  that  object  database 
queries  can  include  method  calls.  This  has  two  major  implications. 

Firstly,  as  the  methods  must  be  executed  on  the  nodes  of  the  parallel  server  a 
mechanism  is  required  to  allow  methods  to  be  compiled  into  executable  code  which 
runs  on  the  server.  This  is  complicated  by  the  fact  that  the  ODMG  standard  allows 
methods  to  be  written  in  a  number  of  languages,  and  so  either  the  server  will  have  to 
provide  environments  to  compile  and  execute  each  of  these  languages,  or  alternatively 
client  applications  using  the  system  will  have  to  be  forced  to  specify  methods  only  in 
those  languages  which  the  server  can  support.  Further,  the  architecture  that  we  have 
proposed  in  this  paper  has  client  applications  executing  method  calls  in  navigational 
code  on  the  client.  Therefore  executable  versions  of  the  code  must  be  available  both  in 
the  server  (for  OQL)  and  on  the  clients.  This  introduces  risks  that  the  two  versions  of 
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the  code  will  get  out  of  step.  This  may,  for  example,  occur  if  a  developer  modifies  and 
recompiles  the  code  on  a  client  but  forgets  to  do  the  same  on  the  server.  Also  a 
malicious  user  might  develop,  compile  and  execute  methods  on  the  client  specifically 
so  as  to  access  object  properties  not  available  through  the  methods  installed  in  the 
server.  Standardising  on  Java  or  another  interpreted  language  for  writing  methods 
might  appear  to  be  a  solution  to  these  problems  because  they  have  a  standard,  portable 
executable  representation  to  which  a  method  could  be  compiled  once  and  stored  in  the 
database  before  execution  on  either  a  client  or  server.  However,  the  relatively  poor 
performance  of  interpreted  languages  when  compared  to  a  compiled  language  such  as 
C++  is  a  deterrent  because  the  main  reason  for  utilising  a  parallel  server  is  to  provide 
high  performance. 

Secondly,  the  cost  of  executing  a  method  may  be  a  significant  component  of  the 
overall  performance  of  a  database  workload,  and  may  greatly  vary  in  cost  depending 
on  its  arguments.  This  could  make  it  difficult  to  statically  balance  the  work  of 
executing  a  query  over  a  set  of  nodes.  For  example,  consider  an  extent  of  objects  that 
is  perfectly  partitioned  over  a  set  of  nodes  such  that  each  node  contains  the  same 
number  of  objects.  A  query  is  executed  which  applies  a  method  to  each  object  but  not 
all  method  calls  take  the  same  time,  and  so,  if  the  computation  is  partitioned  on  the 
basis  of  object  locality,  some  nodes  may  have  completed  their  part  of  the  processing 
and  be  idle  while  others  are  still  busy.  Therefore  it  may  be  necessary  to  adopt  dynamic 
load  balancing  schemes  which  delay  decisions  on  where  to  execute  work  for  as  long  as 
possible  (at  run-time)  so  as  to  try  to  evenly  spread  the  work  over  the  available  nodes. 
For  workloads  in  which  method  execution  time  dominates  and  not  all  nodes  are  fully 
utilised  (perhaps  because  the  method  calls  are  on  a  set  of  objects  whose  cardinality  is 
less  than  the  number  of  nodes  in  the  parallel  system)  then  it  may  be  necessary  to 
support  intra-method  parallelism,  in  which  parallelism  is  exploited  within  the 
execution  of  a  single  method.  This  would  require  methods  to  be  written  in  a  language 
which  was  amenable  to  parallelisation,  rather  than  a  serial  language  such  as  C++  or 
Java.  One  promising  candidate  is  UFO  [20],  an  implicit  parallel  language  with  support 
for  objects. 


4.4  Performance  Management 

Section  3  outlined  the  requirements  for  performance  management  in  a  parallel 
database  server.  In  this  section  we  describe  how  we  intend  to  meet  these  requirements. 

In  recent  years,  there  has  been  an  interest  in  the  performance  management  of 
multimedia  systems  [21].  When  a  client  wishes  to  establish  a  session  with  a 
multimedia  server,  the  client  specifies  the  required  Quality  of  Service  and  the  server 
decides  if  it  can  meet  this  requirement,  given  its  own  performance  characteristics  and 
current  workload.  If  it  can,  then  it  schedules  its  own  CPU  and  disk  resources  to  ensure 
that  the  requirements  are  met.  Ideally,  we  would  like  this  type  of  solution  for  a  parallel 
database  server,  but  unfortunately  it  is  significantly  more  difficult  due  to  the  greater 
complexity  of  database  workloads.  The  Quality  of  Service  requirements  of  the  client 
of  a  multimedia  server  can  be  expressed  in  simple  terms  (e.g.  throughput  and  jitter). 
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and  the  mapping  of  these  requirements  onto  the  usages  of  the  components  of  the 
server  is  tractable.  In  contrast,  whilst  the  requirements  of  a  database  workload  may  be 
specified  simply,  e.g.  response  time  and  throughput,  in  many  cases  the  mapping  of  the 
workload  onto  the  usage  of  the  resources  of  the  system  cannot  be  easily  predicted.  For 
example,  the  CPU  and  disk  utilisation  of  a  complex  query  are  likely  to  depend  on  the 
state  of  the  data  on  which  the  query  operates. 

We  have  therefore  decided  to  investigate  a  more  pragmatic  approach  to  meeting  the 
requirements  for  performance  management  in  a  parallel  database  server  which  is 
based  on  priorities  [11].  When  an  application  running  on  an  external  client  connects  to 
the  database  then  that  database  session  will  be  assigned  a  priority  by  the  server.  For 
example,  considering  the  examples  given  in  Section  3,  the  sessions  of  clients  in  an 
OLTP  workload  will  have  high  priority,  the  sessions  of  clients  generating  complex 
queries  will  have  medium  priority  while  the  sessions  of  agents  will  have  low  priority. 
Each  unit  of  work  -  an  object  access  or  an  OQL  query  -  sent  to  the  server  from  a  client 
will  be  tagged  with  the  priority  and  this  will  be  used  by  the  disk  and  CPU  schedulers. 
If  there  are  multiple  units  of  work,  either  object  accesses  or  fragments  of  OQL 
available  for  execution,  then  they  will  be  serviced  in  the  order  of  their  priority,  while 
pre-emptive  round-robin  scheduling  will  be  used  for  work  of  the  same  priority. 


5  Conclusions 


The  aim  of  the  Polar  project  is  to  investigate  the  design  of  a  parallel,  ODMG 
compatible  ODBMS.  In  this  paper  we  have  highlighted  the  system  requirements  and 
key  design  issues,  using  as  a  starting  point  our  previous  experience  in  the  design  and 
usage  of  parallel  RDBMS.  We  have  shown  that  differences  between  the  two  types  of 
database  paradigms  lead  to  a  number  of  significant  differences  in  the  design  of  parallel 
servers.  The  main  differences  are: 

•  rows  in  RDBMS  tables  are  never  accessed  individually  by  applications,  whereas 
objects  in  an  ODBMS  can  be  accessed  individually  by  their  unique  OIDs.  The 
choice  of  OID  to  PID  mapping  is  therefore  very  important.  A  number  of  schemes 
tor  structuring  OIDs  and  mapping  them  to  PIDs  were  presented  and  compared. 
None  is  ideal,  and  further  work  is  needed  to  quantify  the  differences  between  them. 

•  there  are  two  ways  to  access  data  held  in  an  object  database:  through  a  query 
language  (OQL),  and  by  directly  mapping  database  objects  into  client  application 
program  objects.  In  a  relational  database,  the  query  language  (SQL)  is  the  only  way 
to  access  data.  A  major  consequence  of  this  difference  is  that  parallel  RDBMS  can 
utilise  either  task  shipping  or  data  shipping,  but  task  shipping  is  not  a  viable  option 
for  a  parallel  ODBMS  with  external  clients. 

•  object  database  queries  written  in  OQL  can  contain  arbitrary  methods  (unlike 
RDBMS  queries)  written  in  one  of  several  high  level  programming  languages. 
Mechanisms  are  therefore  required  in  a  parallel  ODBMS  to  make  the  code  of  a 
method  executable  on  the  nodes  of  the  parallel  server.  This  will  complicate  the 
process  of  replacing  an  existing  serial  ODBMS,  in  which  methods  are  only 
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executed  on  clients,  with  a  parallel  server.  It  also  raises  potential  issues  of  security 
and  the  need  to  co-ordinate  the  introduction  of  method  software  releases.  Further, 
as  methods  can  contain  arbitrary  code,  estimating  execution  costs  is  difficult  and 
dynamic  load-balancing,  to  spread  work  evenly  over  the  set  of  nodes,  may  be 
required. 

We  have  also  highlighted  the  potential  role  for  parallel  ODBMS  in  distributed  systems 
and  described  the  corresponding  features  that  we  believe  will  be  required.  These 
include  performance  management  to  share  the  system  resources  in  an  appropriate 
manner,  and  the  ability  to  accept  and  load-balance  requests  from  CORBA  clients  for 
method  execution  on  objects. 


5.1  Future  Work 

Having  carried  out  the  initial  investigations  described  in  this  paper  we  are  now  in  a 
position  to  explore  the  design  options  in  more  detail  through  simulation  and  the 
building  of  a  prototype  system.  We  are  building  the  prototype  in  a  portable  manner  so 
that,  as  described  in  Section  3  we  can  compare  the  relative  merits  of  a  custom  parallel 
machine  and  one  constructed  entirely  from  commodity  hardware. 

Our  investigations  so  far  have  also  raised  a  number  of  areas  for  further  exploration, 
and  these  may  influence  our  design  in  the  longer  term: 

•  More  sophisticated  methods  of  controlling  resource  usage  in  the  parallel  server  are 
needed  so  as  to  make  it  possible  to  guarantee  to  meet  the  performance  requirements 
of  a  workload.  The  solution  based  on  priorities,  described  in  Section  3,  does  not 
give  as  much  control  over  the  scheduling  of  resources  as  is  required  to  be  able  to 
guarantee  that  the  individual  performance  requirements  of  a  set  of  workloads  will 
all  be  met.  Therefore  we  are  pursuing  other  options  based  on  building  models  of  the 
behaviour  of  the  parallel  server  and  the  workloads  which  run  on  it.  Each  workload 
is  given  the  proportion  of  the  systems  resources  that  it  needs  to  meet  its 
performance  targets. 

•  If  it  is  possible  to  control  the  resource  usage  of  database  workloads  then,  in  the 
longer  term,  it  would  be  desirable  to  extend  the  system  to  support  continuous 
media.  Currently  multimedia  servers  (which  support  continuous  media)  and  object 
database  servers  have  different  architectures  and  functionality,  but  there  would  be 
many  advantages  in  unifying  them,  for  example  to  support  the  high-performance, 
content-based  searching  of  multimedia  libraries. 

•  The  ODMG  standard  does  not  define  methods  of  protecting  access  to  objects 
equivalent  to  that  found  in  RDBMS.  Without  such  a  scheme,  there  is  a  danger  that 
clients  will  be  able  to  access  and  update  information  which  should  not  be  available 
to  them.  This  will  be  especially  a  problem  if  the  database  is  widely  accessible,  for 
example  via  the  Internet.  Work  is  needed  in  this  area  to  make  information  more 
secure  from  both  malicious  attack  and  programmer  error. 
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Abstract.  Parallel  processing  on  OODBMS  (Object  Oriented  Database 
Management  Systems)  may  improve  performance  for  non-conventional 
applications  that  manipulate  large  volumes  of  data.  This  work  analyses  and 
develops  techniques  that  contribute  for  the  improvement  of  query  processing 
with  shared-nothing  (SN)  parallel  OODBMS.  A  solution  for  inter-node 
communication  is  developed  where  the  effects  of  communication  are 
minimised  through  the  use  of  message  queues,  reduction  of  the  message  size 
and  the  creation  of  specific  processes  in  each  node. 


1  Introduction 

The  object-oriented  model  is  becoming  very  popular  for  database  systems  due  to  its 
structural  and  behavioural  abstraction  facilities.  Data  intensive  applications  with 
complex  modelling  requirements  such  as  engineering,  medicine  and  geography  are 
natural  candidates  to  the  object  oriented  database  systems  (OODBMS).  Those 
applications  also  impose  high  requirements  for  performance  therefore  parallel 
processing  in  OODBMS  offer  a  promising  solution  to  manage  data  in  these  new 
domains  efficiently.  Object-oriented  query  processing  provides  good  opportunities  to 
exploit  parallel  processing.  However  the  richer  modelling  capabilities  of  the 
OODBMS  impose  some  difficulties  for  an  effective  strategy  for  distributing  objects 
across  multiple  disks.  While  in  the  relational  model  the  database  system  operates 
with  sets  of  tuples,  in  the  00  model  the  database  system  has  to  operate  with 
individual  instances  as  well  as  with  sets  of  objects.  Also,  the  object  clustering  on  disk 
is  not  uniform  to  the  object  class.  Related  sets  of  objects  from  different  classes  may 
be  placed  on  the  same  disk  page.  Combining  object-clustering  algorithms  with  object 
partitioning  is  not  a  trivial  task.  Object  partitioning  is  particularly  important  in 
shared-nothing  (SN)  database  systems  due  to  its  static  nature  of  object  placement.  SN 
database  systems  [10]  with  their  promise  of  .scalability  and  availability,  have  evolved 
as  an  answer  to  these  new  database  applications.  Therefore,  a  solution  to  parallel 
object  database  systems  on  shared  nothing  architectures  have  to  deal  with  a  good 
distributed  design  and  special  query  processing  techniques  to  reduce  the 
communication  overheads. 
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This  work  presents  experimental  object  query  processing  results  with  the  ParGoa 
parallel  object  server.  We  analyse  and  develop  techniques  that  contribute  lor  the 
improvement  of  query  processing  with  shared-nothing  (SN)  systems.  The  ParGoa 
server  is  responsible  for  the  parallel  processing  of  the  Goa-t-+  OODBMS  [9].  The 
Goa+-(-  system  is  ODMG  compliant  and  its  ODL  and  OQL  facilities  are  encapsulated 
with  an  ORB  [15]  (Object  Request  Broker)  interface  from  the  CORBA  standard. 

Experiments  were  made  in  a  network  of  workstations  configuring  a  SN  parallel 
virtual  machine  with  PVM  software  [8].  The  performance  of  the  prototype  was 
evaluated  in  situations  where  there  was  no  inter-node  references  and  in  situations 
where  this  kind  of  communication  was  necessary.  Although  parallel  query  has  been 
explored  extensively  in  the  relational  model,  there  are  few  experiments  with  parallel 
00  database  systems  or  prototypes  [2,  7,  13].  Shore  [6]  is  an  exception  and  this  work 
adopts  similar  solutions  to  the  Shore  system.  However  while  in  [6]  they  analyse  the 
effects  of  traversals  without  object  transfer  between  nodes,  this  work  focus  on  query 
processing  with  and  without  communication  costs.  The  queries  involve  scanning  the 
class  extensions  as  well  as  traversals  on  different  collections  of  objects.  The  goal  of 
this  paper  therefore  is  to  present  a  solution  for  inter-node  object  exchange  where  the 
effects  of  communication  costs  are  minimised.  The  proposed  solution  involves  the 
use  of  message  queues,  reduction  of  the  message  size  and  the  creation  of  specific 
processes  in  each  node  for  handling  object  transfers  and  local  query  processing. 

This  work  is  organised  as  follows.  Section  2  introduces  the  Goa+-(-  system  while 
Section  3  presents  the  main  features  of  the  ParGoa  server.  Solutions  for  handling 
message  passing  in  parallel  object  query  processing  are  presented  in  Section  4.  The 
object  base  where  the  experiments  took  place  is  addressed  in  Section  5.  Performance 
results  are  shown  and  discussed  in  Section  6.  Finally,  Section  7  contains  the 
conclusions. 

2  The  Goa+-i-  System 

The  Goa-f-i-  object  database  system  [9]  works  with  a  client/server  architecture  (Fig. 
I).  The  client  contains  the  application  while  the  server  executes  the  database 
persistence  services  and  parallel  query  processing  as  well  as  other  parallel  set 
operations.  The  Goa-(-(-  client  requests  objects  to  the  Goa-n-  server.  The  transfer  unit 
between  the  client  and  server  is  the  object  (on  collections)  and  not  the  page. 


Fig.  1.  GOA-t-i-  Server  and  JAVA  Application 


The  Goa+-(-  data  model  is  compliant  to  ODMG  standard  therefore  ODL  is  used  to 
create  a  database  schema  and  OQL  is  used  to  query  stored  collections  of  objects. 
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Goa++  is  still  a  prototype  and  currently  is  single  user.  Therefore  it  lacks  concurrency 
control.  However,  it  presents  some  advanced  features  such  as  parallel  processing, 
object  distribution  and  metadata  management. 

The  application  programming  interface  (API)  to  Goa++  (Fig.  2)  is  a  library  ot 
functions  callable  from  C++  or  Java  applications.  The  dotted  lines  indicate  modules 
under  construction.  OQL  can  be  used  as  an  embedded  function  in  a  programming 
language  or  as  an  “ad-hoc”  query  language. 


Fig.  2.  Application  programming  interface  to  GOA 


Goa++  Server  has  4  levels  of  major  classes  that  offer  services  ot  object 
management:  ODMG  compiler  of  languages,  Schema  Manager,  Query  Manager,  and 
Object  Storage  Manager.  We  call  the  combination  of  Query  and  Object  Storage 
Managers  as  the  Goa  Manager. 

When  Goa++  works  with  parallel  query  processing,  the  Goa  Manager  is  replicated 
in  several  processing  nodes.  Also  a  new  level  of  service  is  added  above  Goa  Manager 
to  co-ordinate  the  parallel  execution.  This  server  configuration  is  called  ParGoa 
Server  and  is  discussed  in  the  next  section. 
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3  The  ParGoa  Server 

The  ParGoa  server  is  responsible  for  parallel  processing  on  the  Goa++  server. 
ParGoa  services  may  be  issued  from  the  Goa++  client  or  different  systems  through 
the  API  or  the  ORB  interface.  Previous  implementation  of  the  ParGoa  server 
explored  the  shared  disk  (SD)  system.  However,  the  shared  disk  access  bottleneck  led 
to  the  development  of  the  SN  parallel  version  ParGoa-SN.  In  the  SN  system  the 
processors  work  with  fewer  disk  data  and  may  benefit  from  parallel  disk  access. 
Since  the  parallelism  is  achieved  by  the  ParGoa  server,  this  work  presents  the 
parallel  query  processing  within  the  ParGoa-SN  server  (Fig  3).  When  a  client 
submits  a  query,  the  ParGoa  server  executes  sequentially  the  translation  of  the 
commands  to  an  intermediary  code.  This  code  is  analysed  by  the  ParGoa  Scheduller 
that  generates  the  parallel  execution  plan  of  the  query.  After  this  phase,  the  query  is 
executed  in  parallel  without  the  interference  of  the  Scheduller. 
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Fig.  3.  The  ParGoa-SN  server 


Once  the  schema  is  created,  the  database  administrator  may  define  a 
Iragmentation  policy  to  the  ParGoa  server.  The  Schema  API  does  not  handle  this 
service  because  fragmentation  is  not  supported  by  ODMG  specification.  Therefore 
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the  fragmentation  schema  is  directly  sent  to  the  ParGoa  Scheduller.  Therefore  when 
the  user  submits  an  OQL  query  its  execution  is  parallelized  transparently  by  the 
ParGoa  Scheduler.  This  version  of  ParGoa  query  processor  does  not  create  new' 
objects  or  complex  values  as  an  answer  as  specified  by  ODMG.  The  parallel  query 
always  returns  a  set  of  object  identifiers  which  may  be  requested  and  handled  by  the 
client. 

The  scheduler  co-ordinates  the  execution  of  each  ParGoa  server  operation,  such  as, 
queries,  set  operations  and  storage  management.  These  operations  are  executed 
sequentially  on  each  GOA  node  with  the  scheduler  co-ordination.  The  GOA 
processing  nodes  are  composed  of  a  CPU,  memory,  and  one  disk  drive.  In  this 
experiment  the  nodes  use  the  Ethernet  interconnection  network  for  all 
communication.  Each  GOA  node  has  a  page  cache  of  4Mbytes. 

The  ParGoa-SN  provides  data  parallelism  by  processing  fragments  of  object  sets  in 
parallel.  These  fragments  are  managed  by  the  Fragmentation  Manager  which  allows 
primary  and  derived  horizontal  class  fragmentation.  The  application  can  be  given 
access  to  the  entire  distributed  persistent  object  space  since  the  ParGoa-SN  provides 
location  and  fragmentation  levels  of  transparency.  The  fragmentation  and  placement 
of  objects  during  the  database  workload  uses  full  declustering. 

The  ParGoa-SN  provides  data  parallelism  by  processing  fragments  of  object  sets  in 
parallel.  These  fragments  are  managed  by  the  Fragmentation  Manager  which  allows 
primary  and  derived  horizontal  class  fragmentation.  The  application  can  be  given 
access  to  the  entire  distributed  persistent  object  space  since  the  ParGoa-SN  provides 
location  and  fragmentation  levels  of  transparency.  The  fragmentation  and  placement 
of  objects  during  the  database  workload  uses  full  declustering. 

3.1  ParGOA  Scheduler 

The  ParGoa  Scheduler  is  responsible  for  the  allocation  of  the  new  objects  and  tor  the 
co-ordination  of  the  parallel  query  processing  executed  on  each  node  of  the 
fragmented  database.  It  communicates  with  the  Fragmentation  Manager  and  with  the 
query  processes  that  run  on  the  Goa  nodes.  The  Scheduler  receives  the  query  and 
sends  it  to  the  Fragmentation  Manager.  After  receiving  the  identification  of  the 
nodes  where  the  query  must  be  processed,  the  local  queries  are  sent  to  the  specific 
Goa  nodes  where  it  is  executed.  It  should  be  noted  that  the  Scheduler  and  the 
Fragmentation  Manager  run  on  a  single  node,  so  a  system  bottleneck  may  occur  only 
when  the  queries  arrive  at  this  node.  After  this,  the  parallel  execution  begins  and 
there  are  no  more  interactions  with  the  scheduler  until  each  node  finishes  to  execute 
its  query. 

3.2  Fragmentation  Manager 

The  Fragmentation  Manager  is  responsible  for  determining  the  nodes  involved  in  the 
query  execution.  It  uses  the  information  stored  at  the  fragmentation  schema  that 
describes  how  the  classes  are  fragmented  among  the  nodes.  This  schema  has  for  each 
class,  the  type  of  its  fragmentation  (primary  or  derived),  and  the  attribute  with  the 
corresponding  ranges  used  to  distribute  the  objects.  To  identify  the  nodes  where  the 
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ParGoa  Scheduler  must  send  the  query,  the  Fragmentation  Manager  analyses  the 
query  predicate  against  the  fragmentation  schema. 

4  Parallel  Query  Processing 

When  a  query  is  submitted  to  ParGoa,  the  ParGoa  Scheduler  interacts  with  the 
Fragmentation  Manager  to  see  if  the  query  may  take  advantage  from  the 
fragmentation  strategy.  When  the  fragmentation  attribute  is  not  involved  in  the  query 
predicate,  the  Scheduler  co-ordinates  the  query  execution  on  all  nodes,  otherwise  the 
query  may  be  directed  to  specific  fragments,  i.e.,  nodes.  The  scheduler  sends  query 
predicates  to  be  evaluated  on  each  Goa  node  involved  in  the  query  execution.  Query 
predicates  may  be  classified  in  two  categories:  (1)  simple  predicate  when  the  query 
involves  only  one  class  extension  and  (2)  complex  predicate  when  the  query  involves 
navigation  (traversals)  through  different  class  objects.  During  the  complex  predicate 
evaluation  a  situation  may  occur  where  referenced  objects  might  reside  in  a  different 
node  from  the  predicate  evaluation  node.  To  solve  this  kind  of  situation,  two  process 
were  designed  to  execute  in  each  node  (Fig.  4). 

The  query  process  is  responsible  for  the  evaluation  of  the  predicate  and  for  the 
local  query  execution  returning  to  Pargoa  Scheduler  the  list  of  the  objects  that  satisfy 
the  operation.  The  object  process  caches  and  sends  the  required  objects  requested  by 
other  nodes.  Since  we  used  physical  OID  that  aggregates  the  node  identification,  the 
query  process  knows  for  which  fragment  the  message  requesting  the  object  needed  in 
the  predicate  evaluation  must  be  sent.  Three  approaches  for  inter-node 
communication  were  implemented.  In  the  first  one,  the  query  process  identities  that 
the  object  needed  in  the  query  predicate  is  placed  in  another  node  and  immediately 
requests  the  object  to  be  sent.  This  solution  proved  to  be  unsatisfactory,  because  of 
the  large  number  of  messages  being  exchanged  among  the  nodes.  To  reduce  the 
message  traffic  being  exchanged  between  nodes,  a  second  strategy  was  developed. 
The  query  process  queues  all  the  messages  that  request  objects  from  other  nodes  until 
the  query  processing  may  no  longer  proceed.  The  object  process  receives  a  list  of 
local  object  identifiers  to  be  read  from  the  disk  and  to  be  delivered  to  the  requesting 
node. 


Fig.  4.  Message  handling  within  query  processing 
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Although  this  second  strategy  led  to  significant  performance  improvement,  a  third 
solution  was  implemented.  The  object  process  instead  of  delivering  all  objects 
requested  by  other  nodes,  delivers  only  the  attribute  of  the  objects  needed  for  the 
query  evaluation,  therefore  reducing  not  only  the  number  of  messages  exchanged 
among  nodes  but  also  the  size  of  these  messages. 

5  The  Partitioned  Object  Base 

During  the  experiments  the  007  benchmark  [1]  was  explored  and  this  work  presents 
just  the  execution  over  the  interrelated  classes  shown  in  Figure  5.  Few  adaptations 
were  made  to  the  original  data  scheme  because  of  some  characteristics  not  supported 
by  the  system  and  out  of  the  scope  of  the  tests  objectives. 


Fig.  5.  007  adapted  schema 


The  goal  of  the  benchmark  is  to  test  many  aspects  of  ODBMS  performance,  rather 
than  to  model  a  specific  application.  We  used  the  number  of  objects  as  defined  for 
the  007  medium  size  database,  so  the  number  of  composite  parts  was  set  to  be  500  in 
our  experiments.  Each  composite  part  is  associated  to  a  document  object  connected 
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by  a  bi-directional  link  and  has  an  associated  graph  of  atomic  parts  containing  200 
objects.  One  atomic  part  in  each  composite  part  graph  is  designated  the  “root  part”. 
Each  atomic  part  is  connected  via  a  bi-directional  association  to  three  other  atomic 
parts.  The  connections  between  atomic  parts  are  implemented  by  interposing  a 
connection  object  between  each  pair  of  atomic  parts.  Figures  6  and  7  represent  two 
different  fragmentation  strategies  used  on  the  evaluation  of  ParGoa.  The  first  base 
partitioning  uses  primary  fragmentation  on  Document  class  and  all  other  classes  in  a 
derived  way.  Therefore  the  emphasis  is  in  the  relationship  classes  and  all  related 
objects  are  clustered  in  the  same  fragment.  Thus  in  the  partitioning  strategy  of 
Figure  6  there  is  no  inter-node  communication  because  the  relationships  are  self- 
contained. 


Fig.  6.  Fragmentation  without  communication 

However,  since  this  is  a  special  situation  where  objects  are  not  shared  between 
different  nodes  and  load  unbalance  is  likely  to  occur,  another  fragmentation  policy 
was  implemented. 


Fig.  7.  Fragmentation  with  communication 
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In  the  second  fragmentation  strategy,  the  distribution  of  the  objects  among  the 
nodes  was  changed  forcing  the  necessity  of  communication  between  the  nodes  to 
follow  the  relationships  as  shown  in  Figure  7  by  the  darker  lines.  Although  this 
situation  is  not  favorable  to  SN  database  systems,  it  is  more  realistic  than  the  former 
fragmentation.  The  goal  of  the  fragmentation  with  communication  was  to  measure 
the  impact  of  this  type  of  placement  policy  on  traversals  queries.  The  four  classes 
were  fragmented  as  follows:  the  Composite  and  Atomic  classes  were  primarily 
fragmented  on  the  date  attribute  while  Document  and  Connection  were  derived  to 
Composite  and  Atomic  respectively.  It  should  be  noted  that  in  the  first  fragmentation 
design  the  fragments  might  be  processed  independently  from  each  other.  However, 
while  this  situation  is  highly  favourable  to  the  SN  model  it  may  not  be  realistic 
without  replication.  Thus,  Figure  7  represents  a  more  realistic  situation  where  the 
fragmentation  and  placement  strategy  enables  direct  execution  to  all  the  primary 
fragmented  classes. 

In  each  node,  Goa  uses  collections  to  store  the  objects  that  will  have  associative 
access.  A  class  of  collection  type,  manages  a  list  of  the  OIDs  of  the  objects  belonging 
to  the  collection.  Objects  can  be  clustered  in  the  disk  by  object  composition,  object 
type  or  by  object  reference.  Therefore  a  fragmented  collection  will  have  a  subset  of 
its  OIDs  in  each  node  and  the  corresponding  objects  also  stored  in  the  local  disk. 

Four  queries  (Table  1)  were  evaluated  on  the  experiments.  Query  I  is  part  of  the 
007  specification,  while  Queries  2,  3  and  4  correspond  to  specified  traversals 
embedded  in  queries.  These  queries  represent  a  simple  predicate  in  Query  1  and 
complex  predicates  in  queries  2,  3  and  4. 


Queries 

Description 

Query  1 

Select  Atomic  where  date  <  1011. 

Query  2 

Select  Atomic  where  Composite.Document.date  <  1200 

Query  3 

Select  Composite  where  Atomic. rootpart. date  <1011 

Query  4 

Select  Connections  where  Atomic. date  <  1011 

Table  1.  Implemented  Queries 


In  Query  I  the  Atomics  may  be  scanned  independently  by  all  nodes  with  its  own 
Atomic  fragment.  The  simple  queries  represent  an  embarrassingly  parallel  operation. 
On  the  other  hand,  the  complex  predicates  may  involve  crossing  the  boundary  of  the 
nodes  and  the  two  fragmentation  strategies  may  behave  differently.  To  take 
advantage  of  the  distributed  design,  some  queries  coincide  with  the  primary 
fragmentation  and  therefore  may  be  directed  to  the  specific  nodes.  The  four  queries 
are  described  as  follows: 

Query  I  =>  sequential  scan  query  that  access  every  Atomic  selecting  those  satisfying 
a  determined  range. 

Query  2  =>  navigational  query  traversing  three  classes  selecting  the  Atomics  that  are 
part  of  a  Composite  whose  document  satisfies  a  determined  range. 
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Query  3  =>  navigational  query  traversing  two  classes  selecting  the  Composites 
having  the  Atomic  root  part  satisfying  a  determined  range. 

Query  4  =>  navigational  query  traversing  two  classes  with  large  cardinalities 
selecting  the  connections  that  connect  atomic  parts  in  the  link  having  the  date 
satisfying  a  determined  range. 


6  Performance  Results 

We  implemented  the  ParGoa-SN  and  performed  our  tests  on  a  cluster  of  IBM 
RS/6000(Powerpc)  stations  connected  by  Ethernet.  Each  workstation  had  32MB  of 
main  memory.  The  IBM  Stations  in  the  cluster  were  not  isolated,  and  since  we  did 
not  have  exclusive  access  to  these  workstations  we  did  not  kill  the  usual  suite  of 
daemons  and  background  processes.  However,  we  did  ensure  that  there  were  no 
active  users  on  the  workstations  when  the  tests  were  run. 

The  time  measured  was  the  elapsed  (wall  clock)  time  which  is  the  time  that  the 
user  waits  for  the  problem  solution.  This  kind  of  measure  accounts  not  only  for  the 
computational  work,  but  also  for  any  waiting  for  locks,  paging  and  I/O  [4],  The 
results  show  performance  speedup  for  situations  (hot)  where  the  cache  was  not 
empty,  To  reduce  the  interference  effects  due  to  not  having  isolated  workstations,  we 
re-ran  each  query  20  times. 

We  start  to  measure  the  execution  time  when  the  ParGoa  Scheduler  sends  the 
query  to  all  the  slave  nodes  that  are  involved  in  the  query  until  the  Scheduler 
receives  results  from  all  those  slave  nodes.  It  should  be  noted  that  the  results  showed 
low  variance  and  the  elapsed  time  for  each  query  is  presented  in  [11]. 

These  results  reflect  the  two  fragmentation  strategies  presented  in  Figures  6  and  7 
with  the  medium  size  007  database.  We  begin  by  showing  the  query  results  for  the 
fragmentation  that  involves  no  communication  (Fig.  6)  and  then  we  present  the 
corresponding  results  for  the  second  strategy. 


Fig.  8.  Query  1  with  no  comm. 
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Fig.  9.  Query  2  with  no  ec 
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The  results  for  Queries  1  and  2  in  the  first  fragmentation  design  are  presented  in 
Figures  8  and  9.  These  results  show  significant  speedup  as  expected.  The  super- 
linear  speedup  corresponds  to  the  additional  memory  obtained  with  the  additional 
nodes.  However,  since  the  cache  was  limited  to  4  Mbytes,  even  with  six  nodes,  the 
accessed  objects  could  not  fit  in  the  available  memory,  The  Query  2  execution 
showed  better  results  because  the  query  could  be  directed  to  execute  in  only  one 
node. 

Figures  10  and  11  show  the  results  for  Queries  3  and  4.  As  mentioned  before, 
Query  3  selects  the  composite  parts  that  present  the  root  part  satisfying  a  condition, 
while  Query  4  selects  the  connections  that  connect  atomic  parts  whose  date  satisfy  a 
condition. 

Both  of  these  navigational  queries  were  executed  on  all  nodes  of  the  parallel 
environment.  Although  presenting  a  good  performance  during  parallel  executions. 
Query  3  did  not  show  a  linear  speedup  because  of  the  random  access  performed  to 
the  root-part.  Query  4  presented  better  results  as  more  processors  were  added  to  the 
experiment. 


Figures  12  and  13  present  the  results  for  the  Queries  1  and  2  but  with  the  second 
fragmentation  design  which  involves  communication  to  follow  the  inter-node  object 
references  shown  in  Figure  7.  The  execution  of  Query  1  presented  the  same  results 
from  the  situation  without  communication.  Since  this  query  accesses  only  one  class 
extension,  no  communication  was  necessary  in  either  situation. 

The  execution  of  Query  2  involved  all  nodes  and  communication  between  them  to 
follow  the  object  references.  However,  the  performance  results  are  still  around  linear 
speedup.  This  efficiency  was  a  result  from  the  two  process  architecture  that 
diminished  the  communication  traffic. 

This  individual  instance  access  typical  of  object  oriented  systems  imposed  more 
traffic  on  the  communication  network.  Previous  experiments  of  these  same  Queries 
with  object  communication  and  without  the  message  queue  and  the  object  process 
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presented  no  speed  up  at  all.  Thus  showing  the  impact  of  communication  in  object 
oriented  environments. 


Fig.  12.  Query  1  with  comm.  Fig.  13.  Query  2  with  comm. 

Query  3  (Fig.  14)  was  also  executed  in  all  nodes  of  the  system  in  the  second 
fragmentation  design.  However,  this  second  placement  strategy  forces  the  nodes  to 
exchange  messages  during  the  navigational  query.  Query  4  (Fig.  15)  according  to  this 
new  design  situation  has  the  execution  directed  to  only  one  node.  As  can  be  observed 
in  Figure  14,  Query  3  presented  better  results  in  this  situation  than  in  the  first  one. 
Although  involving  all  nodes  and  communication  between  them  to  follow  the  object 
references  in  the  navigational  predicate,  in  this  strategy  the  query  processor  only 
follows  the  references  after  scanning  the  whole  local  collection.  Therefore  it  deals 
with  the  different  collections  separately  through  its  object  process.  Re-running  Query 
3  with  the  strategy  of  having  two  processes  in  each  node,  but  with  the  no 
communication  design,  the  speedup  obtained  was  higher  than  the  situation  with 
fragmentation  with  communication. 


Fig.  14.  Query  with  comm.  Fig.  15.  Query  4  with  comm. 
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The  results  for  Query  4  (Fig.  15)  also  showed  better  speedup  than  in  the  first 
situation  since  in  this  second  placement  strategy,  the  query  execution  was  done  in 
only  one  node.  This  same  behavior  was  presented  by  Query  2,  where  in  the  first 
design  situation,  the  fragmentation  manager  could  take  advantage  from  the 
fragmentation  strategy  and  direct  the  query  to  be  executed  in  only  one  node.  These 
results  make  evident  the  impact  of  the  distribution  design  in  the  performance  of 
navigational  queries  in  SN  parallel  OODBMS. 

7  Conclusion 

Parallel  processing  on  OODBMS  may  improve  performance  for  non -conventional 
applications  that  manipulate  large  volumes  of  data.  This  work  analyses  and  develops 
techniques  which  contribute  for  the  improvement  of  query  processing  with  shared- 
nothing  (SN)  parallel  OODBMS.  A  strategy  to  solve  external  references  for  the 
evaluation  of  query  predicates  was  implemented.  The  experiments  were  made  in  a 
network  of  workstations  configuring  a  parallel  virtual  machine  with  PVM  software. 
The  performance  of  the  prototype  was  evaluated  in  situations  where  there  was  no 
inter-node  references  and  in  situations  where  this  kind  of  communication  was 
necessary.  The  application  used  in  the  experiments  presented  potential  speedup  even 
in  situations  where  the  communication  was  intensive.  This  work  presents  techniques 
for  inter-node  communication  where  the  effects  of  communication  are  minimised 
through  the  use  of  message  queues,  reduction  of  the  message  size  and  the  creation  of 
specific  processes  in  each  node.  The  results  also  stressed  the  benefits  from  the 
primary  and  derived  fragmentation  strategy  for  distributing  objects.  Finally,  the 
performance  results  in  this  work  show  how  performance  may  improve  and  be 
handled  in  shared  nothing  systems  with  object  orientation. 
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Abstract.  In  this  work  we  present  the  vectorization  of  a  new  complex 
numerical  algorithm  to  simulate  the  lubricant  behaviour  in  an  industrial 
device  issued  from  tribology.  This  real  technological  problem  leads  to  the 
mathematical  model  of  the  thin  film  displacement  of  a  fluid  between  a 
rigid  plane  and  an  elastic  and  loaded  sphere.  The  mathematical  study 
and  a  numerical  algorithm  to  solve  the  model  has  been  proposed  in  the 
previous  work  [9].  This  numerical  algorithm  mainly  combines  fixed  point 
techniques,  finite  elements  and  duality  methods.  Nevertheless,  in  order 
to  obtain  a  more  accurate  approach  of  different  real  magnitudes,  it  is 
interesting  to  be  able  to  handle  finer  meshes  which  increase  storage  and 
computation  costs.  So,  in  order  to  increase  the  performance  of  the  nu¬ 
merical  algorithm  in  terms  of  execution  time,  in  this  work  we  mainly 
apply  vectorization  techniques  and  present  some  preliminary  partial  re¬ 
sults  from  the  design  of  a  parallel  version  of  the  algorithm.  Several  test 
examples  corresponding  to  different  real  data  sets  are  presented  to  illus¬ 
trate  the  advantages  of  high  performance  computing. 


1  The  Industrial  Problem  in  Tribology 

In  a  wide  range  of  lubricated  industrial  devices  studied  in  Tribology  the  main 
task  is  the  determination  of  the  fluid  pressure  distribution  and  the  gap  between 
the  elastic  surfaces  which  correspond  to  a  given  imposed  load  [4].  Most  of  these 
devices  can  be  represented  by  a  ball-plane  geometry  (see  Fig.  1).  So,  in  order 
to  perform  a  realistic  numerical  simulation  of  the  device,  an  appropriate  mathe¬ 
matical  model  must  be  considered.  From  the  mathematical  point  of  view,  the 
lubricant  pressure  is  governed  by  the  well-known  Reynolds  equation  [4].  In  the 
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case  of  elastic  surfaces,  the  computation  of  the  lubricant  pressure  is  coupled  with 
the  determination  of  the  gap.  Thus,  in  Reynolds  equation  the  gap  depends  on 
the  pressure.  In  the  particular  ball-bearing  geometry,  the  local  contact  aspect 
allows  to  introduce  the  Hertz  contact  theory  to  express  this  gap-pressure  depen¬ 
dence.  The  inclusion  of  cavitation  (the  presence  of  air  bubbles)  and  piezoviscosity 
(pressure-viscosity  dependence)  phenomena  is  modelled  by  a  more  complex  set 
of  equations,  see  [5].  Moreover,  the  balance  between  imposed  and  hydrodynamic 
loads  is  formulated  as  a  nonlocal  constraint  on  the  fluid  pressure,  see  [9]  for  the 
details  about  the  mathematical  formulation. 


r 


(a)  Ball-bearing  geometry.  (b)  Ball-bearing  two  dimensional 

domain. 

Fig.  1.  Ball-bearing  device. 


In  Sections  2  and  3  we  briefly  describe  the  model  problem  and  the  numerical 
algorithm  presented  in  [9],  respectively.  In  Section  4  we  make  reference  to  the 
vectorization  and  parallelization  techniques  that  we  have  applied  in  order  to 
improve  the  performance  of  the  algorithm.  In  Section  5  we  present  the  numerical 
results  and  the  execution  times  corresponding  to  several  different  meshes.  Finally, 
in  Section  6  we  present  the  conclusions  we  have  come  to  as  a  result  of  our  work. 


2  The  Model  Problem 

As  a  previous  step  to  describe  the  numerical  algorithm,  we  introduce  the  nota¬ 
tions  and  the  equations  of  the  mathematical  model. 

Thus,  let  n  be  given  by  J?  =  {-Mi,  M2)  x  {-N,N),  with  Mi,  M2,  N  positive 
constants,  which  represents  a  small  neighbourhood  of  the  contact  point.  Let  dH 
be  divided  in  two  parts:  the  supply  boundary  Pq  =  {{x,y)  £  di7  /  x  =  -Mi} 
and  the  boundary  at  atmospheric  pressure  P  =  dQ  \  Pq. 

In  order  to  consider  the  cavitation  phenomenon,  we  introduce  a  new  un¬ 
known  6  which  represents  the  saturation  of  fluid  (0  =  1  in  the  fluid  region  where 
p  >  0  and  0  <  0  <  1  in  the  cavitation  region  where  p  =  0).  The  mathematical 
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formulation  of  the  problem  consists  of  the  set  of  nonlinear  partial  differential 
equations  (see  [2]  and  [6]  for  details)  verified  by  (p,0): 


dx 


dy 


E.h3^ 

V  dy 


{ph)  ,  p  >  0  , 


1  in  17+  (1) 


^  [peh]  =  O,p  =  O,O<0<l  in  ilo 
ox 

Sv 

—  —  =  12s(l  -  0)  cos(n,t)  ,  p  =  0  in  S 
u  on 


(2) 

(3) 


where  the  lubricated  region,  the  cavitated  region  and  the  free  boundary  are 


12+  =  {(x,  y)  €  1?  / p(x,  y)  >  0} 
=  {(a;,y)  e  fl ! p{x,y)  =  0} 

S  =  ai7+  n  12 


and  where  p  is  the  pressure,  h  the  gap,  (s,  0)  the  velocity  field,  ly  the  viscosity,  p 
the  density,  n  the  unit  normal  vector  to  S  pointing  to  12o  and  i  the  unit  vector 
in  the  x-direction. 

In  the  elastic  regime  the  gap  between  the  sphere  and  the  plane  is  governed 
by  (see  [10]  and  [6]): 


h  =  h{x,y,p)  -  ho + 


x^  +  y^ 
2R 


+ 


2  f  _ p{t,u) _ 

-  ty  +  (y  -  uY 


dtdu 


(4) 


where  ho  is  the  minimum  reference  gap,  E  is  the  Young  equivalent  modulus  and 
R  is  the  sphere  radius.  The  Equation  (4)  is  issued  from  hertzian  contact  theory 
for  local  contacts.  The  relation  between  pressure  and  viscosity  is: 


i/{p)  =  uo  e“'’ 


(5) 


where  a  and  uq  denote  the  piezoviscosity  constant  and  the  zero  pressure  viscosity, 
respectively.  Moreover,  the  boundary  conditions  are: 

e  =  00  in  To  (6) 

p  =  0  in  r  .  (7) 

The  above  conditions  correspond  to  a  dreep  feed  device  where  the  lubricant 
is  supplied  from  the  boundary  Jq.  Finally,  the  hydrodynamic  load  generated 
by  fluid  pressure  must  balance  the  load  tu,  imposed  on  the  device,  in  a  normal 
direction  to  the  plane  (see  Fig.  1(a)): 

uj=  p{x,y)dxdy  (8) 

Jn 

In  this  model,  the  parameter  ho  appearing  in  Equation  (4)  is  an  unknown  of 
the  problem  related  to  this  condition.  The  numerical  solution  of  the  problem 
consists  of  the  approximation  of  (p,  0,  h,  ho)  verifying  (l)-(8). 
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3  The  Numerical  Algorithm 


In  order  to  perform  the  numerical  solution  of  (l)-(8),  with  real  industrial  data, 
we  previously  proceed  to  its  adimensionalization  in  terms  of  the  load,  radius  and 
material  data.  This  adimensionalization  leads  to  a  fixed  domain  /?  =  (-4,  2)  x 
(-2,2)  by  introducing  the  hertzian  contact  radius  and  the  maximum  hertzian 
pressure, 


b 


p  _  3a; 

2E  )  '  ^  ~  2nb^ 


respectively,  which  only  depend  on  the  imposed  load  w,  the  Young  modulus  E 
and  the  sphere  radius  R.  Thus,  the  new  dimensionless  variables  to  be  considered 
are: 


P_ 

Ph 


I  hR 


X  = 


Y  = 


a  =  aPi 


h  ■,  V  — 


.  SttvqsR}  _  LJ 

A  =  - : -  ,  ui  = 


u/b 


Phb^ 


2n 

y 


After  this  adimensionalization,  the  substitution  in  (l)-(8)  leads  to  equations 
of  the  same  type  in  the  dimensionless  variables.  In  [9]  a  new  numerical  algorithm 
is  proposed  to  solve  this  new  set  of  equations.  In  this  paper,  high  performance 
computing  techniques  are  applied  to  this  algorithm,  which  is  briefly  described 
hereafter. 

The  first  ideajs  to  compute  the  hydrodynamic  load  (Eqn.  (8))  for  different 
gap  parameters  ho  in  order  to  state  a  monotonicity  between  ho  and  this  hy¬ 
drodynamic  load.  Then,  the  numerical  solution  of  the  problem  for  each  Hq  is 
decomposed  in  the  numerical  solution  of: 


1.  The  hydrodynamic  problem:  Computation  of  the  pressure  and  saturation  for 
a  given  gap.  A  previous  reduction  to  an  isoviscous  case  is  performed  in  order 
to  vanish  the  nonlinearity  introduced  by  (5).  In  this  step,  the  characteristics 
algorithm,  finite  elements  and  duality  methods  are  the  main  tools. 

2.  The  elastic  problem:  Computation  of  the  gap  for  a  given  pressure  by  means 
of  numerical  quadrature  formulae  to  approximate  the  expression  (4).  This 
computation  can  be  expressed  by  means  of  the  loop  shown  below: 

LOOP  in  n  (Gap  Loop) 


hip 


ho  + 


ttE 


E 

k€Th 


I 


r,n+l 


{t,u) 


k  ^/ix  -  t)2  -I-  (y  -  u)2 


dtdu 


(9) 


In  the  previous  formula  the  updating  of  the  gap  requires  the  sum  of  the 
integrals  over  each  triangle  k  of  the  finite  element  mesh  r/, . 

The  complex  expression  of  the  gap  at  each  mesh  point  motivates  the  high 
computational  cost  of  this  step  of  the  algorithm  to  obtain  /i(p""'"^). 
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An  outer  loop  for  the  gap  and  pressure  computations  for  each  value  of 
determines  the  value  of  ho  which  balances  the  imposed  load  (Eqn.  (8)).  This 
design  of  the  algorithm  is  based  on  monotonicity  arguments  and  regula  falsi 
convergence.  In  a  first  term,  an  interval  [ha,  hb]  which  contains  the  final  solution 
ho  is  obtained.  Then,  this  value  is  computed  by  using  a  regula  falsi  method. 

In  order  to  clarify  the  structure  of  the  code,  we  present  a  flowchart  diagram 
of  the  numerical  algorithm  in  Fig.  2. 


Fig.  2.  Flowchart  diagram  of  the  algorithm  in  terms  of  functional  blocks.  Next  to  the 
identifier  of  each  functional  block  appears,  in  inverted  commas,  the  name  of  the  pro¬ 
gramme  variable  updated  as  a  result  of  the  calculus.  The  parameters  mmult,  npat, 
maxith  and  maxrf  represent  the  maximun  number  of  iterations  for  multipliers,  char¬ 
acteristics,  gap  and  load  loops,  respectively. 
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A  numerical  flux  conservation  test  shows  that  better  results  correspond  to 
finer  grids.  Nevertheless,  this  mesh  refinement  greatly  increases  the  computing 
times  and  motivates  the  interest  of  using  high  performance  computing. 


4  Improving  Algorithm  Performance 

The  main  goal  of  our  work  is  to  increase  the  performance  of  the  new  numer¬ 
ical  algorithm  [9]  briefly  explained  above.  In  order  to  carry  out  our  work,  we 
have  considered  two  different  approaches:  in  first  place,  the  use  of  vectorization 
techniques  and,  in  second  place,  the  use  of  parallelization  techniques. 

The  first  part  of  our  work  consisted  on  modifying  the  original  source  code  of 
the  programme  in  order  to  get  the  maximum  possible  performance  from  a  vector 
computer  architecture.  The  target  machine  was  the  Fujitsu  VP2400/10  available 
at  the  CESG A  laboratories  ^ . 

The  second  step  of  our  work  consisted  on  developing  a  parallel  version  of  the 
original  sequencial  source  code  by  using  the  PVM  message-passing  libraries.  In 
order  to  check  the  performance  of  the  parallelization  process,  we  have  executed 
our  parallel  programme  over  a  cluster  of  workstations  based  on  a  SPARC  micro¬ 
processor  architecture  at  85  MHz  and  Solaris  operating  system  interconnected 
via  an  Ethernet  local  area  network. 


4.1  Vectorization  Techniques 

The  first  step  that  must  be  done  when  trying  to  vectorize  a  sequential  programme 
is  to  analyze  its  source  code  and  identify  the  most  costly  parts  of  the  algorithm 
in  terms  of  execution  time.  We  will  focus  most  of  our  efforts  on  these  parts.  We 
used  the  ANALYZER  tool  [7]  in  order  to  carry  out  this  task. 

The  analysis  of  the  programme’s  source  code  lead  us  to  identify  nine  func¬ 
tional  blocks  according  to  their  relationship  with  the  mathematical  methods  used 
in  the  numerical  algorithm,  namely,  finite  elements,  characteristics  and  duality 
methods.  These  blocks,  shown  in  Fig.  2,  are  the  following:  finite  element  matrix, 
initial  second  member  of  the  system,  updating  of  the  second  member  due  to 
characteristics,  final  second  member,  computation  of  the  multiplier,  convergence 
test  on  multipliers,  stopping  test  in  pressure,  updating  of  the  gap  and  stopping 
test  in  the  gap. 

We  have  done  a  study  of  the  distribution  of  the  total  execution  time  of  the 
programme,  and  we  have  come  to  the  conclusion  that  the  90%  of  that  time  is 
taken  by  the  blocks  included  in  the  multipliers  loop  (final  second  member  and 
computation  of  the  multiplier)  and  by  the  block  that  updates  the  gap.  For  this 
reason,  in  this  paper  we  explain  the  vectorization  process  corresponding  to  the 
three  blocks  that  we  have  just  mentioned.  More  detailed  information  about  the 
whole  vectorization  process  can  be  found  in  [1]. 

'  CESGA  (Centro  de  Supercomputacion  de  Galicia):  Supercomputing  Center  of  Gali¬ 
cia  placed  at  Santiago  de  Compostela  (Spain). 
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Final  Second  Member.  The  first  block  inside  the  multipliers  loop  computes 
the  final  value  of  the  second  member  of  the  linear  equations  system  that  is  solved 
in  the  current  iteration  of  the  loop. 

This  block  is  divided  in  two  parts,  the  first  one  is  only  partially  vectorizable 
by  the  compiler  because  of  the  presence  of  a  recursive  reference.  Table  1  shows 
the  execution  times  obtained  for  this  part  before  and  after  the  vectorization 
process  according  to  the  format  h  :  m  :  s./rs,  where  h,  m,  s  and  /US  represent  the 
number  of  hours,  minutes,  seconds  and  microseconds,  respectively. 


Table  1.  Execution  times  corresponding  to  the  routine  modsm(). 


T.Orig 

T.AV 

iprv-AV 

TMV 

iprvJvlV 

meshl 

0.001214 

0.000526 

56.67% 

0.000325 

73.23% 

mesh2 

0.005188 

65.69% 

0.004595 

69.61% 

meshA 

0.060288 

0.020446 

0.018160 

69.88% 

mesh5 

0.040887 

0.036354 

69.77% 

The  first  column  points  out  which  mesh  has  been  used  for  the  discretization 
of  the  domain  of  the  problem,  where  meshl,  meshS,  meshi  and  meshb  consist 
of  768,  9600,  38400  and  76800  finite  elements,  respectively.  The  column  labeled 
as  TJDrig  shows  the  time  measured  when  executing  the  programme  after  a 
scalar  compilation.  The  columns  labeled  as  T-AV  and  T-MV  show,  respectively, 
the  times  after  setting  the  automatic  vectorization  compiler  option  and  after 
applying  this  last  option  over  the  modified  source  code  we  have  implemented  in 
order  to  optimize  the  vectorization  process.  The  columns  iprv-AV  and  iprvJAV 
reveal  the  improvement  with  regard  to  the  original  scalar  execution  time.  The 
percentages  shown  have  been  calculated  by  means  of  the  following  expressions: 

iprv-AV  =  100  *  (1  -  {T.AV /T.Orig))  (10) 

iprv-MV  =  100*  (1  —  {T. MV /T.Orig))  (11) 

The  second  part  of  this  block  has  been  vectorized  by  making  use,  on  the  one 
hand,  of  the  loop  coalescing  technique  [11]  and,  on  the  other,  of  the  in-lining 
transformation  technique.  The  last  one  has  been  applied  manually  so  as  to  be 
able  to  optimize  the  source  code  of  the  particular  case  we  are  interested  in.  This 
resource  introduces  a  generality  loss  in  the  vector  version  of  the  programme  with 
regard  to  its  original  source  code.  We  consider  this  fact  acceptable  because  our 
goal  was  to  reduce  the  execution  time  as  much  as  possible. 

Table  2  shows  the  execution  times  corresponding  to  this  fragment  of  code.  It 
is  important  to  highlight  the  high  improvement  we  have  obtained  with  respect 
to  the  simple  automatic  vectorization. 


Computation  of  the  Multiplier.  The  third  block  inside  the  multipliers  loop 
mainly  copes  with  the  updating  of  the  multiplier  introduced  by  the  duality  type 
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Table  2.  Execution  times  corresponding  to  the  routine  bglfsm{). 


T.Orig 

tjw 

iprv-AV 

T.MV 

iprv-MV 

meshl 

0.001231 

0.001255 

-1.95% 

0.000508 

58.73% 

mesh3 

0.015383 

0.015339 

0.29% 

0.005918 

61.53% 

mesM 

0.059497 

0.057986 

2.54% 

0.023546 

60.42% 

meshb 

0.118632 

0.115769 

2.41% 

0.047059 

60.33% 

method  proposed  in  [3].  Besides,  it  makes  the  necessary  computations  so  as  to 
implement  the  convergence  test  on  multipliers. 

In  the  first  case,  we  are  faced  with  the  fact  that  there  was  a  loop  that  was 
not  automatically  vectorizable  by  the  compiler  because  of  the  presence  of,  on  the 
one  hand,  a  call  to  an  external  function  and,  on  the  other,  a  recursive  reference 
on  the  vector  whose  value  is  computed,  namely,  alfa. 

The  first  of  these  problems  was  solved  by  applying  manually  the  in-lining 
technique.  Again  the  resultant  source  code  was  optimized  for  the  particular  case 
we  are  interested  in,  with  the  corresponding  loss  of  generality. 

The  recursive  reference  present  in  the  source  code  is  not  vectorizable  because 
when  computing  the  i-th  element  of  the  vector  alfa,  the  (i  -  k)-th  element 
is  referenced,  being  k  a  possitive  constant  with  a  different  value  for  each  mesh. 
The  use  of  the  compiler  directive  *VOCL  LOOP,NOVREC(alfa)  (see  [8]),  which 
explicitly  tells  the  compiler  to  vectorize  the  recursive  reference,  has  allowed  us 
to  overcome  the  problem.  We  have  checked  that  the  numerical  results  keep  on 
being  the  correct  ones,  improving  the  percentage  of  vectorization  of  our  code. 

Table  3  shows  the  execution  times,  expressed  in  microseconds,  corresponding 
to  the  process  described  in  the  previous  paragraphs. 


Table  3.  Execution  times  corresponding  to  the  updating  of  the  multiplier. 


T.Orig 

T.AV 

iprv-AV 

T.MV 

iprvJAV 

meshl 

0.001115 

0.001113 

0.18% 

CO 

o 

o 

o 

o 

o 

96.14% 

meshZ 

0.012785 

0.012903 

-0.92% 

0.000162 

98.73% 

meshi 

0.050319 

0.049284 

2.06% 

0.000489 

99.03% 

meshb 

0.100197 

0.098290 

1.90% 

0.000930 

99.07% 

As  can  be  seen,  improvement  percentages  are  close  to  100%  (moreover,  see  the 
great  improvement  obtained  with  respect  to  the  one  obtained  enabling  automatic 
vectorization  without  providing  any  extra  information  to  the  compiler).  If  we 
bear  into  account  that  this  functional  block  is  placed  in  the  body  of  the  innermost 
loop  of  the  algorithm  (multipliers  loop),  which  is  the  section  of  the  programme 
that  is  executed  the  greatest  number  of  times,  we  can  conclude  that  the  reduction 
of  the  execution  time  of  the  whole  programme  will  be  very  important. 
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Updating  of  the  Gap.  The  third  and  last  block  whose  vectorization  process  is 
presented  in  this  paper  copes  with  the  computation  of  the  gap  between  the  two 
surfaces  in  contact  in  the  ball-bearing  geometry  of  Fig.  1(a).  The  source  code 
corresponding  to  this  block  was  fully  vectorizable  by  the  compiler.  Nevertheless, 
we  have  carried  out  some  changes  by  using  source  code  optimization  techniques 
and  the  use  of  scalar  variables  in  reduction  operations  technique. 

Table  4  shows  execution  times  corresponding  to  the  updating  of  the  gap. 


Table  4.  Execution  times  corresponding  to  the  updating  of  the  gap. 


T.Orig 

TJiV 

iprv-AV 

T.MV 

iprv-MV 

meshl 

0:00:01.351421 

0:00:00.060134 

95.55% 

96.15% 

meshZ 

0:03:28.791518 

0:00:08.546158 

95.91% 

96.20% 

meshA 

0:51:53.848528 

0:02:17.901371 

95.57% 

mesh5 

3:26:54.230843 

0:09:09.402314 

95.57% 

As  can  be  seen,  execution  time  improvement  percentage  is  slightly  higher 
than  96%  independentely  of  the  size  of  the  mesh. 

Although  this  part  of  the  algorithm  is  executed  a  relatively  low  number  of 
times,  it  is  by  large  the  most  costly  of  the  programme  (approximately,  14%  of  the 
total  scalar  execution  time).  This  is  due  to  the  fact  that  for  each  node  of  the  mesh 
(the  number  of  nodes  is  19521  in  meshA,  for  example)  it  is  necessary  to  compute 
an  integral  over  all  the  domain  of  the  problem.  That  integral  is  decomposed  in 
a  sum  of  integrals  over  each  finite  element  (the  number  of  elements  is  38400  in 
meshA)  which  are  solved  with  numerical  integration. 

In  Section  5  we  present  the  times  corresponding  to  the  execution  of  the  whole 
programme. 


4.2  Parallelization  Techniques 

The  parallel  algorithm  version  is  at  this  moment  being  refined.  In  this  paper 
we  will  focus  our  attention  on  the  block  corresponding  to  the  updating  of  the 
gap  between  the  two  surfaces  in  lubricated  contact  (see  corresponding  part  of 
Subsection  4.1). 

For  each  node  of  the  mesh,  the  updating  of  the  gap  is  done  according  to  (9). 
If  we  analyze  the  source  code  corresponding  to  this  computation,  we  can  see  that 
we  are  faced  with  two  nested  loops  where  the  outermost  makes  one  iteration  for 
each  node  of  the  mesh  and  the  innermost  makes  one  iteration  for  each  finite 
element. 

Each  iteration  of  the  outer  loop  calculates  the  gap  corresponding  to  one  node 
of  the  mesh.  This  computation  is  cross-iteration  independent.  This  characteristic 
makes  the  efficiency  of  the  process  of  parallelization  to  be  independent  of  the 
data  distribution  among  the  workstations,  which  lead  us  to  choose  one  simple 
data  distribution  such  as  the  standard  consecutive  distribution. 


1029 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


Table  5  shows  the  execution  times  obtained  for  this  functional  block,  after 
the  parallelization  process.  Fig.  3  represents,  for  each  mesh,  the  speedups  corre¬ 
sponding  to  the  execution  times  shown  in  Table  5. 


Table  5.  Execution  times  corresponding  to  the  updating  of  the  gap. 


N 

meshl 

meshZ 

meshi 

meshh 

1 

3.961 

9:29.688 

2:29:35.406 

9:52:36.750 

2 

1.872 

4:21.219 

1:08:52.500 

4 

0.953 

2:19.834 

0:34:35.219 

8 

0.598 

2:27.909 

0:17:24.781 

1:08:56.719 

12 

0.542 

1:18.613 

0:11:40.469 

0:46:11.344 

16 

0.577 

1:00.074 

0:08:54.438 

0:34:57.156 

The  speedup  is  usually  defined  according  to  the  following  formula: 


speedup  =  Ti  /Tn 


(12) 


where  Tn  represents  the  time  when  using  N  processors  and  Ti  the  time  obtained 
when  executing  the  parallel  programme  over  only  one  processor.  In  almost  all 
cases  it  is  obtained  a  superspeedup,  that  is  to  say,  the  value  of  the  speedup  is 
greater  than  the  number  of  processors  used.  This  effect  may  be  due  to  the  fact 
that  we  have  distributed  the  vector  of  the  gaps  corresponding  to  each  node  of 
the  mesh  among  the  set  of  workstations.  In  consequence,  we  have  reduced  the 
size  of  that  vector,  which  has  decreased  the  number  of  cache  misses. 


Fig.  3.  Speedups  corresponding  to  data  shown  in  Table  5. 


1030 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Note  also  that  the  performance  of  the  programme  decreases  when  using  small 
meshes  (meshl  and  mesh2>)  with  a  high  number  of  workstations.  This  effect  is 
due  to  the  communications  overhead. 


5  Numerical  Results 

The  results  obtained  from  the  vectorized  code  were  validated  by  comparing  them 
with  those  ones  calculated  with  the  execution  of  the  scalar  code. 

In  Table  6  different  execution  time  measures  illustrate  the  good  performance 
achieved  after  applying  code  and  compiler  optimization  techniques  with  respect 
to  the  one  obtained  by  the  vectorizer  without  any  extra  code  information.  The 
table  also  shows  the  scalar  execution  time.  In  these  cases  the  coarser  grid,  namely 
meshl  (768  triangular  finite  elements),  is  used  for  the  spatial  discretization. 


Table  6.  Execution  times  for  meshl  on  the  VP2400/10. 


1 

T-Orig 

T^V 

Bigg 

TCPU 

%  Vect.Exec. 

%  Vectorization 

6 

11:24 

DfSP 

24.82% 

53.51% 

1 

n 

25.15% 

58.75% 

9 

15:14 

H  ^ 

1 R 

45.84% 

3 

07:32 

24.42% 

49.34% 

1 

TJ4V 

BlMli 

%  Vectorization 

6 

11:24 

02:31 

94.59% 

/ 

05:03 

01:00 

78.95% 

94.72% 

9 

15:14 

03:50 

80.14% 

94.76% 

i 

01:48 

94.25% 

The  first  column  presents  the  different  input  data  sets,  labeled  as  6,  I,  g 
and  j,  that  have  been  employed  for  the  tests.  The  column  TJDrig  shows  the 
scalar  execution  time  of  the  corresponding  tests  and  the  group  of  three  columns 
labeled  as  T.AV ,  the  times  corresponding  to  the  execution  of  the  automatically 
vectorized  code.  The  column  TCPU  represents  the  total  execution  time.  The 
column  TVU  refers  to  the  programme  execution  time  on  the  vector  units  of  the 
VP2400/10.  The  column  %Vect.Exec.  (vector  execution  percentage)  measures 
the  percentage  of  the  execution  time  that  is  carried  out  on  the  vector  units 
of  the  VP2400/10,  taking  as  reference  the  execution  time  of  the  automatically 
vectorized  version.  Its  value  is  obtained  as  follows; 

%Vect.Exec.  =  (TVU /TCPU)  *  100  (13) 

Finally,  the  column  %Vectorization  provides  an  idea  of  the  characteristics  of  the 
code  so  as  to  be  executed  on  a  vector  machine.  Its  value  is  computed  according 
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to  the  formula 


%V ectorization  =  {T JDrig .V ect  jT .Orig)  *  100  (14) 

where  T.Orig.Vect  is  the  execution  time  in  scalar  mode  corresponding  to  the 
vectorizable  part  of  the  original  source  code,  that  is  to  say, 

T.Orig.Vect  =  T.Orig  -  [TCPU  -  TVU)  (15) 

The  group  of  columns  labeled  as  TJAV  shows  the  times  corresponding  to 
the  execution  of  the  optimized  version  of  the  code  that  we  have  implemented. 

Note  that  we  have  been  able  to  increase  the  VoV ectorization  from  a  50%  up  to 
a  94%,  approximately.  This  redounds  to  a  great  decreasing  of  the  computational 
cost  of  the  algorithm.  The  execution  times  of  the  vectorized  version  for  meshA 
and  meshb  are  presented  in  Table  7. 


Table  7.  Execution  times  for  meshA  and  meshb  on  the  VP2400/10. 


1 

T.MV 

J 

1 

meshA 

meshb  \ 

TVU 

TCPU 

%  Vect.Exec. 

TVU 

TCPU 

%  Vect.Exec. 

b 

1 

08:44:36 

17:14:28 

11:11:55 

22:08:59 

72.82% 

77.28% 

77:36:37 

77.93% 

9 

i 

20:25:24 

33:45:50 

26:17:54 

43:38:38 

76.94% 

76.75% 

68:54:51 

88:37:47 

77.28% 

The  aim  of  the  work  was  not  only  to  reduce  computing  time  but  also  to  test  in 
practice  the  convergence  of  the  finite  element  space  discretization.  So,  the  finer 
grids  meshA  (38400  triangular  finite  elements)  and  mesh5  (76800  triangular 
finite  elements)  were  considered.  For  both  new  meshes,  the  analysis  of  huge 
partial  computing  times  prevented  us  from  executing  the  scalar  code.  In  Figs.  4 
and  5  the  pressure  and  gap  approximation  profiles  are  presented  for  meshA  and 
meshS,  respectively,  in  an  appropiate  scale.  In  both  figures,  the  relevant  real 
parameter  is  the  imposed  load  which  is  taken  to  be  w  =  3. 

Graphics  included  in  Figs.  4  and  5  represent  the  low  boundary  of  the  domain 
of  the  problem  (see  Fig.  1)  on  the  x  axis.  There  are  two  sets  of  curves.  The 
mountain-like  one  represents  the  pressure  of  the  lubricant  in  the  contact  zone. 
The  parabola-like  one  represents  the  gap  between  the  two  surfaces  in  contact. 

Next,  in  order  to  verify  the  expected  qualitative  behaviour  already  observed 
in  previous  tests  with  meshS  when  increasing  the  charge,  we  considered  meshA 
for  the  datum  u;  =  5.  In  Fig.  6  pressure  and  gap  profiles  were  computed  for 
w  =  5. 

6  Conclusions 

From  the  numerical  viewpoint,  the  practical  convergence  validation  of  a  new  al¬ 
gorithm  has  been  performed  with  the  help  of  the  vectorization  techniques  applied 
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Fig.  4.  Pressure  and  gap  profiles  for  w  =  3  and  9o  ~  0.3  with  meshA. 


Fig.  5.  Pressure  and  gap  profiles  for  w  =  3  and  do  =  0.3  with  meshb. 


Fig.  6.  Pressure  and  gap  profiles  for  w  =  5  and  =  0.3  with  meshA. 
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to  the  ancient  scalar  code.  Moreover,  the  accuracy  of  the  approximation  has  been 
increased  with  the  introduction  of  finer  grids  {meshA  and  meshS)  that  confirm 
the  expected  results  for  the  numerical  method  implemented  by  the  algorithm. 

We  have  developed  a  vector  version  where  at  about  the  94%  of  the  original 
source  code  has  been  vectorized.  As  compensation,  our  vector  version  has  par- 
cially  lost  the  generality  of  the  original  programme  and  has  been  implemented 
in  a  manner  that  is  dependent  on  the  architecture  of  the  machine. 

On  the  other  hand,  a  first  approach  to  a  parallel  version  of  the  algorithm  is 
being  developed.  As  an  example,  we  have  presented  a  parallel  implementation  of 
one  functional  block  of  the  algorithm,  being  noticeable  the  superspeedup  obtained 
in  some  cases. 

Nowadays,  we  are  refining  the  parallel  version  so  as  to  reduce  the  execution 
time  corresponding  to  the  multipliers  loop  and  so  as  to  execute  a  final  version 
on  a  multiprocessor  architecture  like  the  Cray  T3E. 
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Abstract.  The  simulation  of  complex  mass  transport  process,  namely  the 
diffusion  affected  flow  of  high  viscosity  fluid  in  transient  press  has  been 
performed.  Massive  parallel  algorithms  were  utilized  at  crucial  phases  of 
simulation  (matrix  formulation,  linear  solver).  The  advanced,  double 
feedback  task  management  system  have  increased  efficiency  of  the 
distributed  implementation  of  algorithms  mentioned  above.  The  first 
feedback  works  outside  of  the  application  and  couples  the  network  diagnostic 
system  with  task  allocation  and  migration  mechanisms.  The  second  one 
deeply  affects  the  numerical  algorithm,  adopting  dynamically  the  mesh 
partitioning  in  each  iteration.  The  stochastic  performance  forecast  is  used  in 
task  allocation  policies. 


Motivation 

Today's  dynamic  simulations  of  transport  processes  are  powerful  and  widely  used, 
e.g.  by  the  space  industry  and  in  the  advanced  control  systems  on  modern  plants. 
Yet,  we  still  make  little  use  of  dynamic  modeling  of  complex  thermochemical 
processes.  Apparently,  there  is  a  growing  demand  for  the  more  advanced  modeling 
of  the  real,  practical  systems.  Contrary  to  an  extensive  research  on  modeling  of 
injection  molding  [1],  the  modeling  of  three-dimensional  flow  fields  in  transient 
presses  is  relatively  little  known  [2],  Such  units  are  widely  used,  e.g.,  in  ceramic  as 
well  as  carbon  industry,  to  form  various  elements  that  later  undergo  the  thermal 
and/or  ageing  treatments.  The  medium  is  usually  multi  component  and  multi  phase 
homogeneous  slurry,  that  shows  very  complex  properties.  Modeling  of  the  flow  in 
such  process  represents  several  major  challenges  since  the  flow  is  inherently 
transient,  includes  a  free  surface  and  fluid  is  moving  through  irregular  extrusion  die. 

The  above  numerical  simulation  requires  huge  CPU  and  RAM  resources,  so  only 
massively  parallel  algorithms  can  be  utilized.  Because  of  limited  access  to  fast 
multiprocessor  installations,  we  decided  to  develop  a  distributed  implementation, 
which  is  dedicated  to  local  network  of  workstations  and  medium-size  LAN  servers. 
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Methods  that  can  dynamically  adapt  a  distributed  application  to  a  current  state  of 
network  environment  significantly  decrease  its  execution  time. 


Mathematical  statement  and  physical  description 


This  work  will  show  the  simulation  of  complex  mass  transport  process  namely,  the 
diffusion  affected  flow  of  high  viscosity  fluid  in  the  transient  press.  The 
mathematical  model  of  the  process  allows  to  examine  the  effects  of  the  nonuniform 
heating  of  the  press  wall  (temperature  dependent  viscosity),  the  geometry  of  an 
extrusion  die  and  the  influence  of  lubricant  of  the  fluid  wall  friction.  An  obvious 
simplification  is  an  assumption  that  slurry  is  Newtonian  compressible  fluid. 

We  will  study  the  behavior  of  fluid  contained  in  time  dependent  domain 
during  the  period  t6[0,fu^].  Its  boundary  dn(t)  is  a  disjoint  sum 

=  of  three  regular  surfaces;  known  rigid 

boundary,  known  free  boundary  and  unknown  free  boundary  respectively. 

We  are  looking  for  the  following  unknowns;  x  Q.{t)  9^^  velocity  of  the 

fluid  and  /7,p.r,g;|Jj{r}xD(f) which  represent  fluid  pressure,  density, 

temperature  and  lubricant  density  respectively.  They  should  satisfy  three  basic 
conservation  laws; 


ap 


dt 

’(a- 


+a/v(p\))  =  0 


pf-l^+a^^'u  |  =  D/v3+pi, 


+  div{Ev)  =  div(3\)  +  QgradT) 


dt 
dg 

—  +  div{gv )  -  Dgradg  =  0 
at 


mass  conservation  law, 


momentum  conservation  law 
(the  Navier-Stokes  system), 


law  of  total  energy  conservation. 

conservation  law  of  the 
lubricant  mass 


where  v  =\{T{t,x)) ,  0  =  0(7(1, x))  and  D  =  D{T{t.x))  are  viscosity  and  thermal 
diffusivity  of  the  fluid  and  diffusion  coefficient  of  the  lubricant,  b  -  b{t.x)  denotes 
external  mass  forces  (e.g.  gravity)  assumed  to  be  well  known  functions  of  time 
^  ^  position  x  6  Q.{t) .  Moreover; 


is  the  energy  density  of  the  fluid, 


/r  =  p(p.7)  =  C,(p-p')^  +Cp7, 
Y  >  1 ,  C, .  C, .  p  ’  >  0 


will  be  utilized  as  the  equation 
of  state. 
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We  assume  the  following  initial  conditions:  0(0)  =Dg,  piO)  =  p^,  p(0)  =  p5, 
T(0)  =  T^,  g(0)  =  gg  on  n{0);  and  for  each  t€[0,/u^]  the  following  boundary 
conditions: 

i)  T(t)  =  T^(t)  on  dC2(t), 

ii)  -1^(0  =  0  on 

iii)  ('u(f)|n^)  vanish  on  d^,^Q(t)  in  coordinate  system  which  is  stiffly  attached 
to  the  press  wall, 

iv)  3(/)n,  =  -p^,  (On,  on  d}.„Q(t) , 

v)  3(0n,  = -(3(0n,|n,)n,  is  perpendicular  to  -ufO-fvCOlnOn,  on  9*^f2(0. 
where  n,  denotes  the  unit  outward  normal  to  the  surface  9 Q(0  •  Detailed 
formulation  of  the  above  problem  can  be  found  elsewhere  in  [3]. 


Ml  niQ 
nQ 

"i*J 

nCi 
Mf,  nQ 


b) 

Fig.  1  Schematic  view  of:  a)  the  evolution  of  fluid  mass  in  the  transient  press, 
b)  the  basic  cell  of  triangulation  in  Q(0). 

Discrete  FE/FD  approach 

We  consider  an  arbitrary  domain  Q  c  with  regular  boundary  such  that 
Sfcj  c  n  and  one-parameter  family  of  planes  which  are  orthogonal  to  the 

axis  such  that  M,,:=  v\':{x~ne^\e^)  =  uj.  Moreover  we  assume  that 

V/7e9t\V^GQ  r^A4  +  7;9a  =  'K^ 
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Where  a  plane  tangent  to  dO.  in  the  point  4.  We  assume  that  we  know  two 

functions  ^  9isuch  that  /»„,„(?)< for  each  fe[0,r„^,] 

(see  Fig.  1  .a).  Let  us  define 

a(0:  =  an{xe5R-\x  =  (x,,x,,x3):A^(/)<X3<A^(0},  fe[0,fu^]  (i) 

and  distinguish  the  sequence  of  points  . 

Using  the  algorithm  of  two  dimensional  (flat)  Delunay  triangulation  [4  5]  we 
attempt  to  triangulate  the  first  layer  Next  we  copy  this  triangulation  to 

remaining  layers  r\Q,i  =  \...,N  solving  the  Dirichlet  problem  for  two- 

dimensional  Laplace  equation.  Then,  using  nodal  points  from  two  consecutive  layers 
we  build  basic  cells  of  triangulation  (see  Fig.  l.b).  Each  of  these  cell  is  spread  to  six 
simplexes  of  cubic  triangulation  in  a  canonical  way.  The  initial  mesh  is  transformed 
to  the  mesh  covering  Q.{t)  for  an  arbitrary  time  instance  f  >0.  Mesh  nodes  take 
new  positions,  but  triangulation  topology  keeps  unchanged.  Let  us  denote  by  p  the 

set  of  node  labels,  associated  with  the  initial  mesh.  Thanks  to  the  constant  topology 
labels  remaind  valid  at  f  >  0 . 

To  solve  the  presented  differential  problem  we  use  the  Faedo-Galerkin  method 
with  respect  to  the  spatial  variables  (see  e.g.  Thomee  [6]  and  Lions,  Magenes  [7]). 
We  will  use  the  family  of  finite  dimensional  spaces  spanned  by 

Lagrange  l"'  degree  splines  (p^:U,{f}xQ(f)  ->  which  are  afine  on  every  simplex 
and  takes  1  at  Xp  eQ.{t),P&p  (the  current  position  of  P'"  node)  and  0  at 

Xg  €  ,  Qsp,  P:aQ  for  each  f  €  [0,f^] .  We  look  for  approximate  solution 

in  a  form: 


(2) 


where  A,p(/)e5R'  represent  approximate  nodal  values  of  {u,(0},/ =  1.2,3 ,  p(t), 

P(f),  T(t),  g(t).  After  spatial  integration  we  may  pass  to  the  initial  problem  for  the 
system  of  ODE’ s: 

L(t)A(0  =  F((,A(t)),  A(0)  =  A„,  A(r):={X3.(0]3,^^^  (3) 

Next  we  can  solve  the  above  system  using  one  of  the  multi-step  methods  such  as 
Adams-Bashforth  [8]  method  of  third  degree: 

-  a,  )  =  ^(23F'(/,,  A,.)  -  ,  A,_, )  +  5F(f,_..A,^, )) 

A,.=  A(f,).  1  =  0,1,2,...,  0<t^<tj  for/<y 

Finally  we  consecutively  solve  the  linear  algebraic  system 
AA,.,  =7?,,  (=1,2,..., 

where 

L, :  =  Lit, ).  R,  =  ■^(23F(r, ,  A,)  -  16F(r,., ,  A,_, )  +  5F{t,_, ,  A,_, ))+  L{t)A, . 


(4) 


(5) 
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The  overall  scheme  of  distributed  computation 

Due  to  the  large  size  of  the  spatial  mesh  and  the  nonstationary  character  of  the 
solution  Ait)  as  well  as  the  solution’s  domain  Q(f),  the  problem  needs  effective  and 
flexible  solving  methods.  First  task  which  should  be  performed  after  input  data  are 
entered  is  the  initial  mesh  generation  in  the  initial  domain  Q(0)  (see  Fig.  2,  Stage  I). 
Thanks  to  the  special  mesh  structure,  which  traces  the  Q(0  deformation,  the  initial 
mesh  topology  is  valid  during  the  whole  computation  time.  Only  new  node  positions 
should  be  determined  at  each  t.  They  can  be  computed  sequentially,  due  to  the  low 
computational  complexity. 


Management 

activity 

levels 


Network  performance  di^ostic&forecast  system 
Task  allocation  level 
Task  migration  level 


Fig.  1  The  general  management  scheme 


Main  operations  which  have  to  be  performed  at  each  time  steps  t,, /  =  1.2,..,  are 
L,  matrix  calculation  and  calculation  of  the  right-hand  side  vector  f?, ,  as  well  as 
the  solution  of  linear  system  (5)  which  represents  the  finite  difference  scheme  to  be 
used.  The  first  task  has  the  square  and  the  second  one  the  cube  serial  computational 
complexity  with  respect  to  the  number  of  degrees  of  freedom  (d.o.f ). 

The  course  of  decreasing  the  computational  time  is  to  parallelize  the  above  tasks 
at  each  time  step  =  1,2,... .  It  can  be  obtained  if  the  mesh  spanned  on  Q(t,)  can 

be  partitioned  into  submeshes  covering  connective,  disjoint  subdomains 

.?=1 . S.  Another  usual  need  is  to  keep  the  number  of  degrees  of  freedom  laying  on 

subdomain  interfaces  Q''(r,.)nQ^(r-) ,  /- ^  on  sufficient  low  level.  The  regular, 
constant  mesh  topology  allow  us  to  split  n(t,)  into  subdomains  composed  of  the 
arbitrary  number  of  element  ,, layers”,  perpendicular  to  the  press  axis.  The  main 
advantage  of  this  method  is  the  linear  growth  of  interface  d.o.f.  only  with  respect  to 
subdomain  number  and  do  not  depend  on  subdomain  sizes. 
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Matrices  L,  and  /?,  can  be  computed  in  parallel  in  each  subdomain,  according  to 
SBS  strategy,  and  then  system  (5)  can  be  transformed  to  the  Schur  Complement 
form  (cf.  Papadrakakis  [9]).  Finally,  the  Schur  Complement  can  be  solved  using 
parallel  GMRES  algorithm  with  distributed  RAM  utilization  (see  Golub  [10]). 

The  described  above  parallel  computation  is  implemented  on  the  distributed 
MIMD  architecture  in'  (e.g.  network  of  workstations)  as  the  set  of 
intercommunicating  processes. 

The  complex  computation  structure  forces  us  to  the  multilevel  managing,  which 
leads  to  maximizing  speedup  and  economic  computer  resources  utilization  (CPU, 
RAM,...)  with  respect  to  dynamic  load  changes  of  computer  nodes.  Managing 
activities  can  be  applied  on  three  distinct  levels: 

1 .  Network  diagnostic  &  forecast  level  at  which  we  decide  about  the  number  and 
size  of  subdomains  Q^(/,  )  which  will  be  valid  during  at  least  one  computational 
step  t. . 

2.  On-line  task  allocation  level  where  a  task  set  can  be  assigned  to  processor  unit 
(workstation)  using  various  policies,  basing  on  a  current  load  measurement. 
Above  policies  will  be  utilized  to  initial  allocation  of  matrix  computation  tasks 
(see  stage  IV  in  Fig.  2)  and  GMRES  tasks  (stage  V  in  Fig.  2)  at  each  GMRES 
iteration  step. 

3.  Task  migration  level,  where  we  can  rearrange  initial  task  allocation  performed  in 
on-line  task  allocation  level,  before  application  comes  to  the  nearest 
synchronization  point. 


Dynamic  task  allocation  strategies 


Task  migration  and  simple  allocation  policies 

Dynamic  task  allocation  is  performed  after  each  synchronization  event,  if 
forthcoming  tasks  can  be  processed  in  parallel.  Three  strategies  can  be  proposed: 

•  task  are  allocated  in  a  cyclic  (round-robin)  manner.  Each  machine  gets 
approximately  the  same  number  of  tasks. 

•  Serial  computational  complexity  of  the  group  of  tasks  which  are  sent  to 
particular  machine  is  proportional  to  its  static  benchmarked  (nominal)  speed. 
Such  solution  is  suitable  for  networks  which  are  only  slightly  loaded  and  without 
a  significant  load  disproportion  between  different  machines. 

•  Serial  computational  complexity  of  the  group  of  tasks  which  is  assigned  to 
particular  machine  is  proportional  to  its  power  coefficient  (see  next  chapter  for 
more  precise  description)  which  is  currently  measured,  or  determined  on  the  base 
of  Markov  load  model  described  later. 

Task  migration  can  change  task  location  before  the  nearest  synchronization  event, 
even  during  the  task  execution.  Only  simple  migration  criterion  was  utilized:  when 
machine  Xe  ff  has  finished  its  work  (become  to  be  idle),  and  on  machine  Y e 
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there  are  at  least  two  processes  of  our  application,  then  one  of  them  is  migrated  to 
X.  This  policy  is  especially  useful  for  on-line  correction  of  coarse-grained  initial 
task  allocation  on  machines  with  the  rapid  and  unpredictable  load  variability,  so  it 
was  mainly  used  in  step  V  (see  Fig.  2). 


Stochastic  dynamics  of  a  computer  network,  and  Markov  allocation  policy 

Key  decision  of  coarse-grained  GMRES  task  forming  can  be  supported  by  stochastic 
forecasts  based  on  Markov  chain  dynamic  model  of  MIMD  architecture  y{ composed 
of  M  processor  units  connected  by  the  asynchronous  communication  medium  (e.g. 
package-oriented  network  technology).  We  distinguish  the  whole  range  of  physical 
performance  parameter  values  of  each  machine  (CPU  utilization,  RAM  occupation, 
load,  etc.)  on  K  adjacent  subranges,  and  choose  the  sequence  of  time  instances 
n=l,2,...  (e.g.  end  of  each  hour),  so  we  may  say,  that  the  particular  machine  is  in  the 
state  y  e  if  its  performance  parameter  is  in  the  /'  subrange.  We  assume 

periodic  behavior  (e.g.  with  the  24  hour  period)  of  each  performance  parameter,  so 
it  suffices  to  identify  the  finite  set  of  transition  matrices  {?"’(«)},  of 

dimension  KxK,  such  that  dynamics  can  be  described  by  Markov  evolution 
formula: 

n'”(«)  =  P"’(«-l)n'”(«-l),  (6) 

where  n”(n)  is  the  probability  distribution  vector  (of  dimension  K)  of  the 
physical  parameter  in  the  n‘"  time  step  for  the  machine  number  m.  If  we  know  the 
initial  distribution  O^di)  we  are  able  to  foreseen  fl^fw)  for  n>/j  using  (6) 
consecutively.  O^fii.)  may  be  set  as  {5^,}  if  we  have  observed,  that  m'"  machine  is 

in  /"'  state  at  the  initial  time  step  n.  We  assume,  that  each  task  to  be  processed  in 
parallel  has  a  computational  complexity  being  a  multiple  of  the  pattern  task 
complexity.  It  is  true  in  case  of  discrete  FEM  analysis,  when  single  element 
operations  quantify  the  pattern  task  size. 

In  order  to  pass  to  more  convenient  description  we  introduce  new  Markov  chain 
T”’{n)  which  describes  the  mean  execution  time  of  pattern  task  in  n'  time  period  on 
m"'  machine.  This  chain  bases  also  on  the  partitioning  on  K  adjacent  subranges, 
similar  as  in  case  of  physical  performance  parameters,  such  that  can  take  K 

discrete  states  with  this  same  probability  distribution  n^f/j) .  We  refer  to  the  paper 
Onderka,  Schaefer  [11]  for  more  details  of  the  above  model. 

Using  identified  chains  {T”’(n)}  ,  the  useful  power  coefficients  for  n"' 

time  instance,  n>//can  be  determined  as: 

(^„(^,/7),  (|)„(4,;7)  = 

A()I,/7)  V 

A(|i.«)  =  l",^„,(lt,«) 


(T-in)) 


a) 
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where  is  the  expected  value  operator  assuming  the  initial  probability 

distribution  .  We  will  use  the  open-loop  Markov  policy  which  consists  in: 

•  evaluating  n^di)  at  the  beginning  of  each  time  loop  t,,/ =  1,2,...  (see  stage 
III  on  Fig.  2)  and  computing  power  coefficients  (t)„(fj.,p.  +  l)  for  m=l,...,M. 

•  decomposing  0.{t)  into  disjoint  subdomains  5  =  for  S=rM 

(computational  mesh  „layers”)  for  some  /-  >  1 ,  such  that  the  overall 
computational  complexity  of  tasks  (matrices  computation  tasks  and  GMRES 
tasks)  associated  with  subdomain  numbers  (m-l)r,  (m-l)r-i-l,...,mr  is  proportional 
to  the  power  coefficient  (t)„,(|j.,|H- 1)  of  rtf'  machine. 

•  allocating  tasks  number  (m-l)r,  (m-l)r+l,...,mr  to  rrf'  processor  unit. 


No  migration 
Migration 

Migration+allocation 


Fig.  1.  Test  result,  case  no.  1 


Test  results 

Various  combinations  of  task  management  methods  have  been  implemented  and 
tested  for  simple  press  geometry,  without  local  adoption  and  computational  mesh 
adoption.  A  possibility  of  flexible  mesh  partitioning  and  hardware  environment 
enabling  tasks  migration  implies  no  need  for  advanced  forecast  methods  (since 
improper  allocation  can  be  easily  improved).  However  the  simulation  system  will  be 
applied  further  for  press  of  more  complicated  geometry  and  h-p  mesh  adoption, 
which  cause  the  significant  growth  of  computation  and  memory  complexity  and 
worse  granularity  of  elementary  tasks.  Sophisticated  stochastic  control  justified  in 
this  case,  as  in  example  presented  in  [12]. 

Below  we  present  results  of  three  different  simple  task  allocation  strategies  used 
during  tests: 
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1 .  all  machines  get  the  same  number  of  tasks  (No  migration  on  the  charts), 

2.  all  machines  get  initially  the  same  number  of  tasks  but  their  location  can  change 
during  the  execution  using  the  migration  criterion  described  above  (Migration), 

3.  similar  to  point  2,  but  the  initial  task  distribution  is  determined  basing  on  the 
result  from  the  previous  time  step  (Migration+ allocation). 


No  migration 
Migration 

Migration+ailocation 


Fig.  1.  Test  result,  case  no.  2 

The  problem  containing  48861  d.o.f.  has  been  solved.  Results  for  first  10  time 
steps  for  two  test  cases  are  presented  below  in  the  Fig.  3  and  4.  In  the  first  case  all 
machines  were  not  loaded  (except,  of  course,  for  some  system  processes/daemons). 
In  the  second  case,  machine  #3  was  additionally  slightly  loaded  by  two  processes 
(consuming  about  30%  of  CPU)  during  iterations  5-7.  In  the  first  case,  all  results 
were  similar,  with  a  little  advantage  of  the  migration-based  strategies.  In  the  case 
no.  2,  migration  policies  (especially  Migration+ailocation)  gave  much  better  results 
then  the  simple  round-robin  strategy. 


Table  1.  Test  case  #2,  strategy  No  migration 


iteration 

number  of  tasks  on  machine  | 

#1 

#2 

#3 

#4 

1 

370 

38/-0/-(-0 

31/-0I+Q 

37/-0/-H0 

38/-0/-(-0 

2 

375 

38/-0/+0 

37/-0/+0 

311-01+0 

38/-0/-(-0 

3 

38/-0/-I-0 

31/-0/+0 

311-01+0 

38/-0/-(-0 

4 

379 

38/-0/-t-0 

37/-0/+0 

311-01+0 

38/-0/4-0 

5 

498 

38/-0/-h0 

37/-0/+0 

311-01+0 

38/-0/4-0 

6 

480 

38/-0/-(-0 

37/-0/+0 

311-01+0 

38/-0/+0 

7 

498 

38/-0/-t-0 

31/-0I+0 

8 

376 

38/-0/+0 

311-01+0 

311-01+0 

38/-0/+0 

9 

377 

38/-O/-H0 

311-01+0 

311-01+0 

38/-0/+0 

10 

383 

38/-0/-H0 

311-01+0 

311-01+0 

38/-0/+0 
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More  detailed  results  for  the  second  test  case  are  presented  in  Tables  1-3.  For 
each  machine  are  shown:  initial  number  of  tasks  /  number  of  tasks  migrated  from 
the  machine  /  number  of  tasks  migrated  to  the  machine. 


Table  1.  Test  case  #2,  strategy  Migration 


iteration 

HIH 

#1 

#2 

#3 

#4 

! 

363 

38/-0/-(-4 

37/-0/+12 

37/-13/+0 

38/-8/+5 

2 

369 

38/-0/+4 

37/-0/-f-12 

37/-12/+0 

38/-8/+4 

3 

359 

38/-0/+4 

21/-0/+5 

31/-9/+0 

3S/-2/+2 

4 

368 

38/-0/+4 

21l-0/+\2 

37/-I2/+0 

38/-8/+4 

5 

430 

38/-0/+5 

37/-0/+10 

37/-0/+3 

38/-18/+0 

6 

416 

38/-0/+6 

37/-0/+8 

37I-0/+3 

38/-17/+0 

7 

425 

38/-0/+6 

37/-0/-(-9 

31/-OI+3 

38/-18/+0 

8 

369 

38/-0/+5 

37/-0/+11 

37/-13I+0 

38/-8/+5 

9 

362 

38/-1/+4 

37/-0/+12 

37/-12/+0 

38/-7/-h4 

10 

370 

38/-3/+3 

37/-0/+11 

37/-10/+0 

38/-7/+6 

Table  2.  Test  case  #2,  strategy  Migration+allocation 


iteration 

time 

number  of  tasks  on  machine  | 

[s] 

#1 

#2 

#3 

#4 

1 

366 

38/-0/+5 

37/-0/-I-10 

37/-11/+0 

38/-8/+4 

2 

361 

43/-14/+0 

47/-0/-I-7 

26/-0/-I-8 

34/-6/+5 

3 

356 

29/-0/+12 

54/-5/-(-4 

34/-11/+0 

33/-5/+5 

4 

364 

41/-8/+0 

53/-2/+0 

23/-0/-H10 

331-AI+A 

5 

331-01+1 

51/-0/+5 

33/-0/+2 

33/-I4/+0 

6 

396 

40/-5/+4 

56/-0/+5 

35/-12/+0 

19/-0/+8 

7 

390 

391-01+3 

61/-4/-I-0 

27/-8/+2 

8 

378 

42/-5/-H4 

57/-6/+1 

30/-4/+0 

21/-0/+10 

9 

357 

41/-8/-(-0 

52/-0/+2 

26l-0l+\ 

31/-0/+5 

10 

353 

33/-0/+6 

54/-3/+0 

mnimax^ 

36/-4/-H] 

Scalability  of  the  whole  computational  scheme 

Scalability  of  the  parallel,  distributed  application  with  respect  to  the  MIMD 
architecture  ^ can  be  understood  as  the  relationship  5c  Ihi,  where  is  the 
set  ot  resources  (CPUs,  RAM,  etc.),  9^31'  is  the  set  of  application  requirements 
(computation  time,  etc.),  k,  I  are  natural  numbers.  We  may  study  also  conditional 
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scalability  as  the  partial  relationship  S^crS  with  some  fixed  parameters  from  both  Hi 
and 

We  may  consider  in  case  of  GMRES-SBS  application  k=l=2,  ^)), 

where  N,  R,  n,  T  are  the  number  of  processors,  RAM  per  node,  number 
of  degrees  of  freedom  and  the  time  of  parallel  computations  respectively.  Assuming 
again  that  the  number  of  GMRES  iterations  is  proportional  to  the  (see  [13]),  we 
have: 

3 

•  is  proportional  to  ,  by  7’=const., 

•  is  proportional  to  n-l^n  +1  assuming  the  constant  memory  per  node, 

•  -ijjby  the  constant  efficiency,  where  A,  B,  C 

are  positive  constants,  which  depend  on  the  transmission  speed,  processor 
arithmetic  and  the  amount  of  computation  for  the  single  element  in  the  single 
GMRES  iteration  and  for  the  local  stiffness  matrix  formulation  and  inversion. 

See  [14]  for  the  proof  of  the  above  proposition. 


Mesh  partitioning 

Proper  partitioning  of  a  mesh  has  a  great  influence  on  the  resulting  computation 
performance.  Every  slave  task  has  a  computational  complexity  of  0{n’^)  (where 
rt^.  is  the  number  of  internal  d.o.f.  in  )  and  the  size  of  the  global  interface  is 

(only)  a  linear  function  of  subdomain  number.  Therefore,  at  the  first  sight,  the  more 
subdomains  we  have,  the  shorter  is  computation  time.  In  practice,  however,  too 
large  number  of  tasks  requires  a  lot  of  network  transmissions  which  can  „consume” 
all  profits  gained  from  the  lower  theoretical  complexity  of  problem. 

Our  tests  show  that  there  exist  few  points  (possible  single)  of  „equilibrium” 
between  increase  of  computation  time  caused  by  too  big  size  of  subdomains  and 
performance  degradation  introduced  by  too  frequent  and  intensive  network 
communication.  At  that  points  the  total  execution  time  (as  a  function  of  tasks 
number)  has  a  minimum.  Moreover,  it  seems  that  they  occupy  narrow  range  so  in 
practice  they  can  be  easily  found  experimentally.  Naturally,  they  can  be  different  for 
different  computational  environments  or  different  conditions  (e.g.,  load,  network 
traffic).  Nevertheless,  we  think  it  is  enough  to  determine  the  „optimar’  number  of 
tasks  once,  at  the  beginning  of  computations. 

Below  you  can  see  execution  times  of  one  SBS  iteration  for  a  mesh  composed  of 
24461  d.o.f.  Tasks  has  been  allocated  equally  to  all  available  machines;  no 
migration  was  applied. 
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Fig.  1.  Execution  time  for  various  mesh  partitionings 

Since,  as  the  tests  show,  in  usual  case  a  mesh  should  be  divided  on  at  least 
several  tens  of  domains  which  are  distributed  among  only  several  machines,  we 
always  partition  a  mesh  into  subdomains  of  equal  (almost  equal,  in  fact)  sizes.  Such 
approach  simplifies  task  management  (e.g.,  it  is  easier  to  predict  execution  times). 


Computational  environment 

All  tests  presented  in  this  paper  has  been  run  using  a  network  composed  of: 

•  4  Sun  SparcStation  (4,  5  and  20  models)  machines  with  various  RAM  amount 
(24- 1 28  MB)  running  SunOS  4.1.4  —  for  slave  processes, 

•  1  PC  (Pentium  66)  running  Linux  —  for  master  and  control  processes. 

MPVM  3.3.4  [15]  as  a  communication  and  migration  tool  has  been  used. 

Target  version  of  system  described  in  this  paper  is  developed  according  to  Object 
Oriented  (00)  paradigm,  using  CORBA-based  library  (Chorus  [16]).  This  version 
have  been  run  in  heterogeneous  environment  composed  of  Sun  SparcStations  ELC. 
Sun  490,  SGI  Origin  200  and  exhibits  a  quite  similar  effectiveness  as  PVM  version. 
CORBA  based  distributed  environments  allow  object  mobility  which  is  equivalent 
to  task  migration  in  MPVM,  offering  much  more  portability  with  respect  to 
hardware  platform  and  operating  systems.  This  possibility  will  be  soon  included  in 
our  target  version  of  the  system. 
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Conclusions 

i.  The  simulation  of  the  forming  process  will  give  the  ultimate  answer  for  the  best 
manufacturing  parameter  combinations  to  reduce  the  production  cost. 

ii.  The  increased  speed  of  computations  will  allow  to  implement  process  simulation 
as  the  guideline  for  mixing,  recipe  parameters'  adaptation  and  optimalization  of 
the  formation  conditions  in  a  production  plant. 

iii.  Task  migration  can  dramatically  improve  control  effect  in  rapid  changes  of 
computer  performance  during  one  time  iteration  step  (see  case  #2,  fig  4) 

iv.  The  application  of  stochastic  policy  is  meaningless  in  case  of  fine  granularity  of 
elementary  tasks,  because  migration  can  easy  improve  improper  initial 
allocation.  However  stochastic  policy  is  best  suited  to  the  coarse-grained 
problems  with  huge  computational  and  memory  complexity  and  periodically 
loaded  network.  It  appears  mainly  in  complex  geometry  CAE  problems  solved 
by  h-p  adoptive  finite  element  method  (see  e.g.  [12,  14]. 

V.  Final  number  of  tasks  (after  migration)  which  is  run  on  the  machine  during 
particular  iteration  can  be  used  to  determine  the  power  coefficient  for  the  next 
step  (see  case  #2,  migration  +  allocation) 

vi.  Partial  scalability  tests  (number  of  subdomains  vs.  CPU,  RAM  and  network) 
prove  that  there  exists  optimal  partitioning  rate  for  the  current  CAE  parallel 
problem  and  distributed  computer  environment  (see.  Fig.  5). 


References 


1.  Manzione  L.  T.  ed.;  Application  of  Computer  Aided  Engineering  in  Injection  Molding, 
HanserPubl.,  1987. 

2.  Bom  M  et  al.;  Carbon,  30,  141,  1992. 

3.  Danielew.ski  M,  Holly  K,  Bozek  B,  Bednarz  S,  Golec  S  and  Filipek  R;  Dynamics  of  the 
graphite  electrode  forming  process,  Univ.  of  Mining  and  Metallurgy,  Cracow,  1998,  Rep. 
1246/98. 

4.  Bozek  B,  Holly  K,  Jaskblski  J;  Variance  methods  for  thermo  load  of  elements  of  IC 
engine,  20th  International  On  Combustion  Engines  (CIMAC  1993),  London  1993,  D74. 

5.  Holly  K,  Mosurski  R;  An  automatic  triangulation  of  an  arbitrary  flat  domain,  Opuscula 
Mathematica,  Vol.  17,  1997,  pp.  23  -  32. 

6.  Thomee  V;  Galerkm  Finite  Element  Methods  for  Parabolic  Problems.  Springer  Verlaa 
1984. 

7.  Lions  J.-L,  Magenes  E;  Problemes  aux  limites  non  homogenes  et  applications,  vol.  1  et  2, 
Paris,  Dtmod,  1968. 

8.  Burden  R.  L,  Faires  .1.  D;  Numerical  Analysis  third  edition,  Boston,  Prindle,  Weber  & 
Schmidt,  1985. 


1047 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


9.  Papadrakakis  M,  Bitzarakis  S;  Domain  decomposition  PCG-methods  for  serial  and 
parallel  processing,  Computing  Systems  in  Engineering,  1995. 

10.  Geist  A.  and  others;  PVM  3  User’s  Guide  and  Reference  Manual,  Oak  Rigde  National 
Laboratoiy',  1994. 

1 1 .  Onderka  Z,  Schaefer  R;  Markov  chain  based  management  of  large  scale  distributed 
computations  of  earthen  dam  leakages,  Lecture  Notes  in  Computer  Science  1215,  Springer 
Verlag  1996,  pp.  49-64. 

12.  Mysliwiec  G,  Sipowicz  J,  Schaefer  R;  Control  activities  in  Message  Passing  Environment, 
Lecture  Notes  in  Computer  Science  1332,  Springer  Verlag  1997,  pp.  143-150. 

13.  Barragy  E,  Carey  G.F,  Van  de  Geiju  R;  Performance  and  Scalability  of  Finite  Element 
Analysis  for  Distributed  Parallel  Computation,  Journal  of  Distributed  and  Parallel 
Computing  21,  1994,  pp.  202-212. 

14.  Flasihski  M,  Schaefer  R,  Toporkiewrcz  W;  Scalability  in  Concurrent  Discrete  Analysis  of 
Structures.  Proc.  ofPCCMM’97,  Vol.  l,pp.  379-386,  Poznan  1997. 

15.  Casas  J.,  Clark  D.  L.,  Konuru  R.,  Otto  S.W.,  Prouth  R.M.,  Walpole  J;  MPVM;  A 
Migration  Transparent  Version  of  PVM,  Computing  Systems,  vol.  8,  No.  2,  pp.  171-216, 
1995 

16.  The  Common  Object  Request  Broker  Architecture  and  Specification;  Revision  2.0,  Object 
Management  Group,  1995. 


1048 


VECPAR'98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Experimental  Analysis  of  a  Parallel 
Quicksort-Based  Algorithm  for  Suffix  Array 
Generation 


Autran  Macedo\  Marco  Antonio  CristoS  Elaine  Spinola  Silva^  Denilson 
Moura  Barbosa\  Joao  Paulo  Kitajima^  Berthier  Ribeiro^  Gonzalo  Navarro^, 

and  Nivio  Ziviani^ 

^  Departamento  de  Ciencia  da  Computagao 
Universidade  Federal  de  Minas  Gerais 
Belo  Horizonte,  MG  -  BRAZIL 
latin@dcc.ufing.br 

^  Departamento  de  Ciencias  de  la  Computacion 
Universidad  de  Chile 
Santiago  -  CHILE 
gnavarro@dcc.uchile.cl 


Abstract.  This  paper  presents  experiments  performed  with  an  imple¬ 
mentation  of  a  quicksort-based  parallel  indexing  algorithm.  Besides  the 
expected  reduction  in  execution  time,  it  was  observed  that  the  word 
frequency  distribution  of  the  input  textual  database  has  a  strong  influ¬ 
ence  on  performance.  Communication  and  computational  load  balances 
are  achieved  by  processing  the  same  quantity  of  text  on  each  proces¬ 
sor.  This  effectively  occurs  due  to  the  auto-similar  feature  of  texts,  ver¬ 
ified  experimentally  in  this  work.  Also,  as  seen  by  the  experiments,  the 
auto-similarity  of  the  word  frequency  distribution  implies  that  this  dis¬ 
tribution  is  independent  of  the  text  size.  In  terms  of  implementation, 
the  knowledge  a  priori  of  this  word  frequency  may  improve  the  indexing 
time  by  eliminating  certain  parts  of  the  algorithm. 


Keywords:  Parallel  Processing,  Information  Retrieval,  Index  Generation, 
Auto-Similarity,  Message  Passing. 


1  Introduction 

Information  retrieval  is  a  research  area  of  growing  interest  by  the  scientific  com¬ 
munity.  One  of  the  most  relevant  research  field  in  that  area  is  the  string  search 
in  textual  databases.  This  string  search  involves  not  only  the  database  query, 
but  also  the  database  indexing  and  the  user  interface  [1].  In  the  case  of  infor¬ 
mation  retrieval  in  Internet  homepages,  the  search  process  may  involve  also  the 
automatic  scan  of  World  Wide  Web  (WWW)  sites  and  the  download  of  these 
homepages  for  further  indexing. 

Index  generation  time  is  critical.  It  has  at  least  the  sequential  complexity 
in  the  order  of  the  database  size,  since  all  the  words  may  be  indexed.  In  this 
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sense,  parallel  strategies  can  be  devised  in  order  to  reduce  the  index  generation 
time.  The  algorithm  proposed  in  [2]  is  based  on  a  quicksort  approach,  where 
the  textual  database  is  partitioned  through  processors  interconnected  by  a  fast 
network  [3].  The  index  structure  is  based  on  a  suffix  array  [1], 

The  goal  of  this  paper  is  to  present  some  experimental  results  of  the  first  ver¬ 
sion  of  the  algorithm  implementation.  It  was  observed  that  the  word  frequency 
distribution  in  textual  databases  plays  an  important  role  on  the  program  perfor¬ 
mance.  The  following  Section  describes  the  parallel  index  generation  algorithm. 
Next,  experiments  on  index  generation  are  presented.  The  influence  of  text  char¬ 
acteristics  is  discussed  and  followed  by  some  conclusions. 


2  The  Parallel  Generation  of  Suffix  Arrays 

Searching  a  large  full  text  for  user  specified  patterns  is  a  time  consuming  task 
which  requires  special  indexing  schemas.  A  suffix  array  (or  pat  array)  [1]  is  a 
linear  structure  composed  of  pointers  to  every  suffix  in  the  text  (since  the  user 
normally  is  allowed  to  query  on  words,  it  is  customary  to  index  only  word  begin¬ 
nings).  These  index  pointers  are  sorted  according  to  a  lexicographical  ordering 
of  their  respective  suffixes  and  each  index  pointer  can  be  viewed  simply  as  the 
offset  (counted  from  the  beginning  of  the  text)  of  its  corresponding  suffix  in  the 
text.  To  find  the  user  patterns,  binary  search  is  performed  on  the  array. 

The  central  idea  of  the  parallel  algorithm  is  as  follows,  considering  a  fast 
network  of  independent  computers  [3].  Imagine  the  final  result  of  the  process; 
the  global  sorted  suffix  array.  If  that  array  is  cut  in  b  similarly-sized  portions 
(which  is  called  slices),  what  the  algorithm  does  is  to  assign  a  slice  to  each 
processor  and  make  it  sort  that  slice.  Originally,  each  processor  contains  some 
elements  of  each  slice. 

An  Q-percentile  is  the  value  at  position  an  in  the  global  sorted  suffix  array. 
For  example,  the  1/r-percentile  is  the  element  at  position  b.  Our  algorithm  par¬ 
titions  the  data  to  be  worked  on  by  each  processor  by  finding  the  percentiles 
1/r,  2/r,  ...,  (r-l)/r  (r  is  the  number  of  processors).  An  alternative  defini¬ 
tion  for  slice  is:  the  portion  of  the  global  suffix  array  between  two  consecutive 
(i/r)-percentiles. 

The  algorithm  proceeds  in  four  steps: 

-  Step  1  -  One  master  processor  splits  the  text  into  pieces  of  same  size  and 
distributes  them  among  slave  processors; 

-  Step  2  -  Each  processor  builds  internally  its  local  suffix  array  and  determines 
its  local  percentiles; 

-  Step  3  -  The  processors  cooperate  to  find  the  r  global  percentiles.  This 
defines  the  part  of  each  slice  stored  at  each  processor; 

-  Step  4  -  The  processors  engage  in  a  distribution  process  so  that  every  pro¬ 
cessor  gets  the  part  of  its  slice  stored  on  any  other  processor; 

-  Step  5  -  Each  processor  completes  internally  the  sorting  of  its  slice. 
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Consider  a  text  T.  T  is  split  into  pieces  of  the  same  size  according  with 
the  number  of  processors  involved  in  the  generation  of  the  index,  so  that  each 
processor  has  a  part  of  the  text  (step  1).  Each  processor  generates  its  local 
suffix  array  (step  2).  In  order  to  generate  the  global  suffix  array,  the  processors 
find  the  percentiles  of  its  local  suffix  array  and  broadcast  them  to  the  others  to 
determine  the  global  percentiles  (step  3).  After  that,  each  processor  is  able  to 
know  which  part  of  its  suffix  array  belongs  to  itself  and  which  parts  belongs  to 
their  partners.  The  global  all-to-all  communication  is  performed  (step  4).  Finally, 
each  processor  sorts  its  local  suffixes  (step  5).  The  concatenation  of  local  suffix 
array  (of  all  processors)  leads  to  the  global  index  suffix  array. 

It  can  be  noticed  that  the  steps  4  and  5  are  time  dominant,  due  to  suffix 
arrays  broadcast  and  the  need  of  I/O  operations.  Figures  1  and  2  show  this  fact. 
Figure  1  presents  two  turning  points  (phases  1  and  5  at  8  processors)  due  to 
contention  of  disk.  It  can  be  argued  why  not  just  parallelize  only  these  steps. 
The  answer  is  the  availability  of  primary  memory.  The  aim  of  this  study  is  the 
index  generation  of  very  large  files  (order  of  GigaBytes).  A  single  computer  to 
perform  steps  1,  2  and  3  could  not  support  all  text  in  primary  memory,  without 
avoiding  page  faults. 

In  this  way,  the  complexity  of  the  algorithm  (see  [2]  for  details) ,  in  the  average 
case,  is 


0(6  log  n)I  +  0(r^  log  6  +  6)C  =  0(61ogn)I  +  0(6)C 

where  n  is  the  text  size  and  h  is  the  slice  size.  I  is  the  computation  unit  cost  and 
C  is  the  communication  cost. 

3  Experimental  Analysis 

The  following  Sections  present  the  experimental  environment  and  measures  con¬ 
cerning  execution  time  and  load  balancing.  Parallel  quicksort  is  typically  a  scal¬ 
able  strategy:  a  reduction  of  the  execution  time  is  expected.  However,  for  parallel 
indexing,  the  characteristics  of  the  input  textual  database  also  influence  strongly 
the  program  performance. 


3.1  Experiments 

Two  message  passing  parallel  machines  are  being  used  in  the  implementation 
of  the  algorithm.  One  machine  is  in  the  CENAPAD  -  MG/CO,  a  brasilian  su¬ 
percomputer  center,  located  at  UFMG  (Federal  University  at  Minas  Gerais); 
the  other  one  is  in  the  LMC  (Laboratory  of  Modeling  and  Calculus),  located 
at  IMAG  (Mathematics  Institute  of  Grenoble)  in  France.  The  computer  of  CE¬ 
NAPAD  -  MG/CO  is  an  IBM  SP  with  41  nodes  and  48  processors  at  120  MHz 
with  memory  ranging  from  256  MegaBytes  (MB)  to  1  GigaBytes  (GB)  of  pri¬ 
mary  memory.  The  network  is  a  switch  at  155  MB/s  and  all  processors  share  a 
single  disk  system.  The  computer  in  LMC  is  an  IBM  SP  with  32  processors  at 
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Fig.  1.  Percentage  of  time  spent  by  each  step  of  the  program,  when  it  was  submmited 
to  a  network  of  workstations  in  CENAPAD  -  MG/CO  with  a  text  input  file  of  100  MB. 


66  MHz  and  64  MB  of  main  memory.  The  network  has  a  switch  at  40  MB  of 
unidirectional  bandwidth  and  a  local  disk  for  each  processor. 

The  parallel  program  is  written  in  ANSI  C  using  MPI  (Message  Passing  Inter¬ 
face)  [4]  as  communication  library.  The  benchmark  textual  database  is  composed 
of  file  texts  of  100  MB  and  200  MB,  extracted  from  the  Wall  Street  Journal  of 
TREC-3  collection  [5]. 

The  experiment  results  presented  in  this  article  were  obtained  by  executing 
the  program  in  CENAPAD  -  MG/CO  considering  1,  4,  8,  and  16  processors. 
Some  details  should  be  stated: 

—  only  54  MB  of  main  memory  were  used  by  the  processors  of  CENAPAD  - 
MG/CO.  This  limitation  was  set  because  it  is  the  maximum  portion  of  main 
memory  used  by  the  LMC  processors,  when  experiments  are  performed  in 
France.  This  memory  size  compatibility  have  to  be  kept  because  memory  is 
important  in  this  study; 

-  experiments  with  2  processors  were  not  performed  due  to  the  politic  of  mem¬ 
ory  utilization  adopted  by  the  program.  By  this  politic,  |  of  main  memory 
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Fig.  2.  Percentage  of  time  spent  by  each  step  of  the  program,  when  it  was  submmited 
to  a  netw'ork  of  workstations  in  CENAPAD  -  MG/CO  with  a  text  input  file  of  200  MB. 


is  left  to  the  pat  array  and  the  other  |  is  left  to  the  text.  Two  processor 
would  have  too  many  I/O  disk  operations,  what  would  be  a  similar  case  of 
the  sequential  algorithm  implementation. 

Figures  3  and  4  present  the  performance  of  the  program,  considering  text  files 
of  100  MB  and  200  MB.  The  speedup  was  measured  considering  an  sequential 
implementation  of  the  algorithm  presented  in  [6]. 

It  can  be  noticed  (figure  4)  that  with  4  processors  speedup  is  super-linear 
(5.12  and  5.76  with  files  of  100  MB  and  200  MB,  respectively)  and  with  16 
processors  speedup  is  bad.  Super-linear  speedup  is  observed  because  the  imple¬ 
mentation  of  the  sequential  algorithm  has  a  quadratic  behaviour  due  to  I/O 
disk  operations,  although  its  complexity  is  nlogn  [6].  Sub-linear  speedup  occurs 
because  of  the  single  disk  shared  by  every  processors  in  CENAPAD  -  MG/CO. 
The  more  processors  involved  in  the  index  generation,  the  greater  is  the  num¬ 
ber  of  files  that  each  processor  must  deal  with.  These  files  are  created  by  the 
processors  during  the  broadcast  (step  4).  The  competition  by  I/O  bus  and  seek 
time  of  the  disk  determine  the  low  performance  of  the  program,  in  this  envi- 
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Fig.  3.  Measures  of  elapsed  time  for  index  generation  using  1,  4,  8  and  16  processors, 
connected  by  fast  network  at  CENAPAD  -  MG/CO,  considering  files  of  different  sizes. 


ronment.  The  domination  of  I/O  time  in  the  performance  of  program  can  be 
observed  in  figures  5  and  6.  The  curves  show  the  percentage  of  time  spent  in 
step  4  of  the  algorithm.  It  can  be  devised  that  the  percentage  of  time  in  I/O  op¬ 
erations  is  increasing,  as  increases  the  number  of  processors.  There  are  3  curves: 
communication,  I/O  operations,  and  others.  This  last  curve  reflects  activities 
like: 

-  suffix  array  compression; 

-  package  of  data  to  be  transmitted; 

-  contention  in  the  suffix  array  transmission. 

Figures  5  and  6  present  also  a  turning  point  on  curve  “others” ,  when  the 
number  of  processors  is  8.  This  turning  point  is  caused  by  the  competition  of 
the  processors  for  the  disk.  Besides,  the  graphics  show  the  percentage  of  time 
spent  by  the  algorithm  when  communication,  I/O  disk  operations,  and  other 
factors  are  considered.  Beyond  8  processors,  the  percentage  of  time  of  other 
factors  (curve  “others”)  decrease,  on  the  hand  the  percentage  of  time  of  I/O 
disk  operations  increase. 
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Fig.  4.  Curves  of  speedup  obtained  in  a  network  of  workstations  at  CENAPAD  - 
MG/CO. 


3.2  Influence  of  Text  Characteristics 

One  concern  with  the  algorithm  here  presented  is  the  load  balance  during  steps 
4  and  5.  More  specifically,  in  the  step  4,  a  non  homogeneous  communication 
load  occurs  if  exchanged  parts  of  slices  are  of  different  sizes.  In  step  5  also, 
the  resulting  slice  to  be  sorted  locally  would  have  different  sizes  for  different 
processors.  This  would  imply  a  non  balanced  computational  load.  This  happens 
when  the  word  frequency  distribution  is  not  auto-similar. 

A  structure  is  said  strictly  auto-similar  [7]  if  it  can  be  recursively  decomposed 
in  small  pieces  where  each  one  is  a  replica  of  the  original  structure.  It  is  important 
to  say  that  these  parts  are  obtained  through  a  scale  transformation  of  the  original 
structure.  Those  structures  that  can  be  decomposed  in  similar  parts  until  a  given 
scale  is  said  auto-similar. 

In  order  to  detect  auto-similarity,  experiments  were  done  over  the  follow¬ 
ing  collections  [5]  AP  (Associated  Press)  (1988),  WSJ  (1987),  and  Ziff-Davis 
(complete).  The  collection  files  were  split  into  pieces  of  1,000;  10,000;  100,000; 


1055 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


SPENT  TIME  of  FASE  4ofthe  ALGORITHM  (FILE=100  MB) 

SO  •  I  I  I 

Communication  — 
.  I/O  Disk  — 

70  -  .  Others  . 


Fig.  5.  Percentage  of  time  spent  in  step  4  of  the  algorithm  (which  includes  communi¬ 
cation,  I/O  operations  on  disk,  compression  of  suffix  arrays,  packaging  of  data  to  be 
transmitted,  contention  in  suffix  array  transmission),  considering  a  file  of  100  MB. 


300,000,  550,000;  700,000;  850,000;  and  1,000,000  words.  These  diffrente  sizes 
were  to  detect  if  the  number  of  words  in  files  of  different  sizes  grows  linearly. 

Figures  7  and  8  describe  the  number  of  words  starting  with  a  given  letter  ver¬ 
sus  the  alphabet  letters,  independently  of  the  text  file  size.  Since  these  graphics 
have  the  same  shape,  they  are  similar,  from  the  geometric  point  of  view.  This 
similarity  occured  in  all  files  collections  experienced. 

Table  1  presents  linear  regression  information  concerning  each  alphabet  letter 
for  the  AP  collection.  The  numbers  in  this  table  confirm  that  the  number  of  words 
starting  by  a  given  letter  E  grows  linearly  with  the  text  file  size.  The  coefficient 
of  determination  of  the  linear  regression  is  close  to  1  (see  third  column).  Similar 
results  were  obtained  in  Wall-Street  Journal  and  Ziff-Davis  Collection. 

The  second  column  of  Table  1  (times  100)  can  be  also  considered  as  the 
percentage  of  words  starting  with  a  given  letter  E  in  a  text  file  of  size  t.  Ignoring 
round  errors,  the  total  sum  of  the  numbers  of  the  second  column  of  this  table  is 
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Fig.  6.  Percentage  of  time  spent  in  step  4  of  the  algorithm  (which  includes  communi¬ 
cation,  I/O  operations  on  disk,  compression  of  suffix  arrays,  packaging  of  data  to  be 
transmitted,  contention  in  suffix  array  transmission),  considering  a  file  of  200  MB. 


On  a  probabilistic  point  of  view,  Table  2  presents  the  distribution  function 
of  the  word  frequency.  Due  to  the  auto-similarity,  this  function  is  independent 
of  the  text  file  size.  For  example,  the  probability  of  choosing  a  word  starting  by 
letters  A  or  B  or  ...  or  K  is  around  50%,  independent  of  the  file  size. 


Experimentally,  the  text  auto-similar  feature  shows  that  the  computed  global 
percentiles  effectively  generates  a  homogeneous  distribution  of  suffix  pointers 
among  the  processors  (step  4:  communication  load  balance).  The  number  of 
bytes  sent  and  received  by  each  processor  is  almost  the  same  for  all  processors. 
Consequently,  the  final  local  sort  (step  4)  will  work  with  roughly  the  same  num¬ 
ber  of  pointers,  implying  a  computational  load  balance.  The  same  conclusion 
was  obtained  by  simulation  of  the  algorithm  [2]. 
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Fig.  7.  Distribution  of  the  word  frequency  (considering  just  the  first  letter)  in  the  AP 
collection,  with  100,000  words. 


4  Conclusions 

This  paper  presented  experiments  performed  with  an  implementation  of  a  quick- 
sort-based  parallel  indexing  algorithm.  Besides  the  expected  reduction  in  exe¬ 
cution  time,  it  was  observed  that  the  word  frequency  distribution  of  the  input 
textual  database  has  an  significant  influence  on  performance. 

The  experiments  were  performed  in  CENAPAD  -MG/CO,  a  supercomputing 
center  whose  machine  (an  IBM  SP)  share  a  single  disk  among  all  processors.  This 
computation  environment  implied  in  a  performance  decrease  of  the  program  as 
the  number  of  processors  (involved  in  the  index  generation)  is  greater  or  equal 
16.  This  distortion  occured  due  to  the  great  competition  of  processors  by  I/O 
bus  and  the  seek  time  of  the  disk. 

As  future  work,  experiments  will  be  performed  with  large  files  and  launched 
in  LMC  (France).  In  LMC  there  is  a  disk  for  each  processor  ant  this  fact  will 
certainly  decrease  of  the  I/O  bus  competition  and  it  will  be  possible  to  see  better 
curves  of  speedup. 
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1  £\  Slope 

Coeff.  of  Determ.  (  ) 

m 

0,123193195 

0,997446196 

0,05259202 

0,997838698 

0,055195342 

0,998992098 

0,046114325 

0,998172851 

B 

0,05154366 

0,99805157 

B 

0,017667218 

H 

0,047498466 

0,996058323 

I 

0,061164202 

0,997900196 

Q 

0,008406008 

0,99654862 

m 

0,005621952 

0,992414744 

Q 

0,024723614 

0,998464284 

IQ 

0,0386796 

0,997983352 

m 

0,026271595 

0,998471551 

B 

0,059174769 

0,997932797 

p 

0,047215617 

0,99792522 

Q 

0,001560132 

0,99621972 

R 

0,033622413 

0,997980403 

B 

0,093770112 

0,997193251 

Q 

0,14873902 

0,998472823 

Bi 

0,014391688 

0,999281988 

Bl 

0,006702519 

0,998155787 

BI 

0.0573626 

0.997223945 

0,00012865 

0,968463675 

HI 

0,008135312 

0,997146319 

Bl 

0,000446293 

0,979992358 

Table  1.  Slope  and  coefficient  of  determination  obtained  from  linear  regression  for 
words  starting  by  i  (regression  of  the  measures  of  the  text  file  size  verstis  word  fre¬ 
quency  starting  by  a  given  letter  £), 
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Interval 

Percentage  of  Text 

A 

0,123193195 

A-B 

0,175785215 

A-C 

0,230980556 

A-D 

0,277094881 

A-E 

0,298701995 

A-F 

0,350245655 

A-G 

0,367912872 

A-H 

0,415411338 

A-I 

0,47657554 

A-J 

A-K 

A-L 

0,515327114 

A-M 

0,554006714 

A-N 

A-0 

0,639453078 

A-P 

0,686668695 

A-Q 

0,688228827 

A-R 

0,72185124 

A-S 

0,815621353 

A-T 

A-U 

0,978752061 

A-V 

0,985454579 

A-W 

1,042817179 

A-X 

1,042945829 

A-Z 

. . . 

Table  2.  Distribution  function  of  the  word  frequency  per  letter  £. 
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Abstract  In  this  paper,  a  distributed  computational  system  for  finite  element 
structural  analysis  and  some  strategies  for  improving  its  efficiency  are 
described.  The  system  consists  of  a  set  of  programs  that  performs  the  structural 
analysis  in  a  distributed  computer  network  environment.  This  set  is  composed 
by  a  pre-processor,  a  post-processor,  a  program  responsible  for  partitioning  the 
model  in  substructures,  and  by  a  structural  analysis  parallel  solver.  The  domain 
partitioning  is  performed  interactively  by  the  user  through  a  graphics  interface 
program  (PARTDOM).  An  existing  FEM  code,  based  on  object  oriented 
programming  concepts  (FEMOOP),  has  been  adapted  to  implement  the  parallel 
features.  Different  implementation  aspects  concerning  scalability  and 
performance  speed-up  are  discussed.  The  computational  environment  consists 
of  a  100  Mbit  Fast-Ethemet  network  cluster  including  eight  Pentium  200  MHz 
micro-computers  running  under  LINUX  operating  system. 


1  Introduction 

In  this  work,  a  distributed  computational  system  for  finite  element  structural  analysis 
will  be  presented.  This  system  has  been  developed  to  perform  the  parallel  analysis 
using  a  disposable  local  area  network  as  the  processing  environment.  This 
arrangement  makes  possible  the  use  of  low  cost  pa^lel  capabilities,  avoiding  the 
utilization  of  expensive  supercomputers.  A  set  of  integrated  components,  each  one 
responsible  for  a  specific  task  in  the  analysis  process,  comprises  the  computational 
system.  The  four  basic  units  of  this  system  are  presented  in  Fig.  1. 
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Fig.  1.  Basic  units  of  the  parallel  analysis  system 

This  set  has  four  components:  a  pre-processor,  a  post-processor,  a  program 
responsible  for  partitioning  the  model  in  substructures,  and  a  structural  analysis 
parallel  solver.  The  main  focus  here  is  the  domain  partitioning  and  the  solver,  in 
which  the  parallel  features  have  been  implemented.  The  domain  partitioning  is 
performed  interactively  by  the  user  throu^  a  graphic  interface  program,  called 
PARTDOM.  Three  automatic  substnicturing  procedures  are  available  and  the  user  is 
able  to  manually  edit  the  resulting  subdomain  partitions.  An  existing  FEM  code, 
based  on  object  oriented  programming  concepts  (FEMOOP),  has  been  adapted  to 
implement  the  parallel  features.  The  main  extensions  to  the  code  are  the  introduction 
of  a  preconditioned  conjugate-gradient  parallel  algorithm  to  solve  the  global 
equations  and  the  commumcation  procedures  among  the  processors. 

In  the  following  sections,  the  parallel  analysis  system  and  its  components  will  be 
described  in  detail.  Also,  different  aspects  affecting  the  system  performance  will  be 
discussed.  The  performance  is  measured  through  the  time  consumed  in  each  step  of 
Ae  analysis  processing  for  different  number  of  processors.  Some  strategies  for 
improving  the  system  efficiency  have  been  implemented  and  their  results  will  be 
discussed. 


2  The  Parallel  Analysis  System 

All  components  of  this  parallel  system  are  presented  in  this  section.  Special  attention 
is  placed  on  the  domain  partitioning  and  the  analysis  programs.  The  main 
development  and  implementations  are  located  in  these  programs. 
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2.1  Pre-Processing 

The  package  MTOOL  (Bidimensional  Mesh  Tool)  [1]  has  been  used  as  the  pre¬ 
processor  in  this  work.  MTCXDL  is  an  interactive  graphics  program  for  bidimensional 
finite  element  mesh  generation.  With  this  program,  the  geometry,  material  properties, 
and  boundary  conditions  of  the  model  are  defined.  Through  its  graphical  interface 
(Fig.  2),  the  program  makes  the  visualization  and  the  editing  of  the  generated  mesh 
possible. 


1 

f 

'-*•1 

ga 

'C»t4 

Fig.  2.  MTOOL  graphical  interface 


At  the  end  of  model  creation,  a  neutral  format  file  is  generated.  This  file  contains, 
in  a  standard  way,  all  the  information  about  the  model,  which  are  necessary  in  the 
analysis  process.  The  neutral  file  is  an  essential  feature  for  the  integration  of  all 
components  of  the  parallel  system. 


2.2  Partitioning 

After  the  model  generation,  the  structure  has  to  be  partitioned  into  a  number  of 
substructures  to  take  advantage  of  the  parallel  environment.  This  partitioning  is  made 
through  the  use  of  automatic  domain  partitioning  algorithms.  The  main  objectives  of 
these  algorithms  are:  (1)  to  obtain  a  balanced  work  distribution  among  the  processors, 
and  (2)  to  minimize  the  boundary  degrees  of  freedom  (DOF)  present  between  the 
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substructures.  These  two  objectives  represent  important  aspects  that  aflfect  the  overall 
system  performance. 

The  basic  aim  of  PARTDOM  program[2]  is  to  facilitate  the  user  interaction  with 
the  automatic  domain  partitioning  algorithms.  The  program  allows  the  partition  of  a 
bidimensional  finite  element  mesh  using  three  different  algorithms.  Through  its 
graphical  interface  (Fig.  3),  the  program  permits  the  visuali2ation  and  manipulation  of 
the  resulting  subdomain  partitions.  Thus,  PARTDOM  has  been  developed  with  the 
following  objectives: 

□  to  promote  user  interaction  with  the  automatic  domain  partitioning  algorithms, 

□  to  facilitate  partitioning  of  complex  meshes,  through  a  graphical  interface  that 
allows  visualization  of  resulting  partitions, 

□  to  allow  interactive  user  edition  of  resulting  partitions, 

□  to  be  portable,  i.e.,  the  program  code  is  platform  independent. 


Fig.  3.  PARTDOM  graphical  interface 

The  three  partitioning  algorithms  used  in  PARTDOM  are:  Al-Nasra  &  Nguyen 
algorithm  [3)  and  two  algorithms  from  the  METIS  library  [4]  called  pmetis  and 
kmetis.  Fig.  4  presents  the  resulting  putitions  applying  the  three  algorithms  over  a 
finite  element  mesh  composed  by  587  T6  elements. 
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Fig.  4.  Mesh  partitions  resulting  from  the  use  of  algorithms:  (a)  Al-Nasra  &  Nguyen  (b) 
pmetis,md(c)kmetis 


The  three  algorithms  present  a  good  balanced  work  distribution  among  the 
processors,  with  approximately  the  same  number  of  elements  for  each  substructure. 
However,  the  Al-Nasra  &  Nguyen  algorithm  generally  presents  a  number  of  boundaiy 
DOF  between  the  substructures  greater  than  the  pmetis  and  kmetis  algorithms.  This 
characteristic  increases  the  requirement  of  communication  between  the  processors 
and,  consequently,  decreases  the  parallel  analysis  performance.  The  partitions 
resulting  from  the  use  of  pmetis  and  kmetis  algorithms  present  approximately  the 
same  number  of  boundaiy  DOF. 

At  the  end  of  this  process,  the  partitioning  information  is  added  to  the  neutral  file. 


2.3  Analysis 

After  Ae  model  generation  and  partitioning,  the  next  step  consists  of  the  parallel 
analysis  of  this  model,  with  the  use  of  a  set  of  processors,  each  of  one  responsible  for 
a  part  of  the  computational  work.  In  this  work,  the  Finite  Element  Method  has  been 
employed  in  the  analysis.  A  analysis  program  named  FEMOOP  (Finite  Element 
Method  -  Object-oriented  Programming)[5],  developed  at  the  Department  of  Civil 
Engineering  (PUC-Rio)  and  at  Computational  Mechanics  Laboratory  (Polytechnic 
School  /  USP),  has  been  used  as  the  platform  for  the  new  parallel  capabilities 
additions. 

The  program  FEMOOP  is  organized  using  object-oriented  concepts  of  the  C++ 
programming  language[6][7].  One  of  the  most  important  advantages  of  the  object- 
onented  programming  is  the  code  extensibility.  This  feature  allow's  new 
implementations  with  minimum  impact  over  the  existent  code.  To  adapt  FEMOOP  to 
the  parallel  computational  environment  a  new  class  has  been  created,  which  is 
resjwnsible  for  data  manipulation.  Also,  a  series  of  new  functions  has  been 
implemented  into  existent  classes. 

message-passing  manager  used  in  this  work  has  been  PVM  (Parallel  Virtual 
Machine)[8].  PVM  is  a  software  system  that  permits  a  heterogeneous  computer 

^  ^  ®  computer.  The  first  step  necessary  to  adapt 

PtMOOP  to  the  parallel  environment  has  been  the  implementation  of  a  library 
responsible  for  the  message  passing  management.  The  main  objective  of  this  library  is 
to  Imut  the  direct  access  to  PVM  functions.  An  eventual  change  of  the  message- 
passmg  manager  is  facilitated,  which  has  impact  only  over  the  hbraiy'  code.  This 
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parallel  procedure  libraiy  contains  all  functions  necessary  to  perform  the  message 
passing  in  a  distributed  memory  environment.  The  main  functions  implemented  here 
are  responsible  for  sending  and  receiving  messages  among  processors,  for  parallel 
process  initiahzation,  and  for  the  identification  of  program  type  (if  it  is  either  a  master 
or  a  task  program). 

The  parallel  programming  paradigm  adopted  here  has  been  the  master-slave  model. 
In  this  model,  the  master  is  a  separate  program  responsible  for  process  spawning, 
initialization,  reception  and  display  of  results,  and  timing  of  functions.  The  task  (or 
slave)  programs  are  executed  concurrently  and  interact  through  message  passing.  The 
actual  stmctural  analysis  is  done  by  the  task  programs,  each  of  one  responsible  for  the 
work  corresponding  to  one  substructure.  Through  interaction  between  these  task 
programs,  the  global  solution  is  obtained  and  then  it  is  sent  to  the  master  program. 

To  obtain  the  linear  system  of  equilibrium  equations  a  substructuring  technique[9] 
has  been  employed.  When  this  technique  is  used,  the  original  structure  is  partitioned 
into  a  number  of  substructures,  which  are  distributed  among  the  processors.  The 
substructure  degrees  of  freedom  are  classified  as  internal  DOF  and  boundary  DOF, 
which  are  shared  between  neighbor  substructures.  The  substructures  interact  through 
these  boundary  DOF  only.  Then,  the  stiflftiess  matrices  of  each  substructure  are 
mounted.  In  this  work,  the  internal  unknowns  are  eliminated  using  Grout  method. 
After  this  step,  a  condensed  system  with  terms  corresponding  only  to  Ixtundaiy' 
unknowns  is  obtained.  AH  these  procedures  can  be  performed  concurrently.  To  solve 
the  partitioned  global  system,  a  parallel  iterative  solver  has  been  used.  A  parallel 
implementation  of  the  pre-conditioned  conjugate  gradient  (PCG)  method[10][9]  has 
been  chosen  as  the  solution  method  adopted  in  this  work.  Basically,  this  parallel 
implementation  of  the  PCG  method  consists  of  parallel  operations  between  matrices 
and  vectors.  The  sequence  of  operations  is  the  same  both  in  the  parallel  and  in  the 
sequential  versions  of  the  PCG  method. 

To  employ  this  substructuring  technique,  the  stiffness  matrix  the  force  vector 
f^'^,  and  the  nodal  unknowns  u  ,  corresponding  to  each  substructure  i,  have  been 
mounted  with  terms  partitioned  into  internal  and  boundary  terms.  These  partitions  are 
presented  in  (1),  where  indices  I  and  S  correspond  respectively  to  internal  and 
boundary  (or  shared)  terms. 
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For  each  substructure,  the  linear  equation  system  is  vwritten  in  the  form 


(1) 


- 1 

HI 

HI 

1 - 

. 

L/fJ 

(2) 


Eliminating  the  internal  unknowns,  a  condensed  equation  system  is  obtained 
^  5  ^  s  ~  J  s  ’ 


(3) 
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where  Ks  and  are  respectively  the  condensed  stiffness  matrix  and  the 


condensed  force  vector,  and 

Ks  ^JS  ^JS  ’ 

(4) 

7^')  _  /•(')  jrd)Wdy'  fd) 

Js  ~Js  ^IS  Jj  ■ 

(5) 

The  global  linear  equation  system  can  be  written  in  the  form 


/=! 


(6) 


where  p  is  the  number  of  substructures  and  is  a  boolean  matrix  that  describes  the 
substructure  coimectivity  in  the  original  structure. 


2.4  Post-Processing 

A  program  named  MVBEW  (Bidimensional  Mesh  View)[ll]  has  been  utilized  as  the 
post-processor  tool.  MVEW  is  an  interactive  graphical  program  (Fig.  5)  for 
visualization  of  structural  analysis  results.  This  program  provides  visualization  of  the 
deformed  configuration  and  contours  of  scalar  results  at  nodes  and  Gauss  points. 


Fig.  5.  MVIEW  graphical  interface 
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3  Aspects  of  Parallel  Processing 


In  this  section,  different  aspects  that  affect  the  parallel  analysis  system  performance 
are  presented. 

The  quahly  of  partition  obtained  from  application  of  automatic  domain  partitioning 
algorithms  affects  the  global  performance  of  the  parallel  system.  A  good  partition 
generates  substructures  with  approximately  the  same  number  of  elements,  and  with 
minimum  number  of  bound^  nodes  between  neighbor  substructures.  The 
equilibrated  distribution  of  elements  is  important  to  balance  the  computational  work 
load  among  the  processors,  avoiding  occurrence  of  idle  periods.  Idle  periods  occur 
when  a  processor  interrupts  its  job  to  wait  for  a  information  from  a  busy  processor. 
These  waiting  intervals  cause  system  performance  degradatioa  The  number  of 
boundary  nodes  between  substructures  is  a  important  aspect  that  influences  the 
parallel  system  solver  performance.  When  the  number  of  boundary  nodes  increases, 
the  number  and  size  of  messages  exchanged  during  the  solution  phase  also  increase. 
The  three  automatic  domain  partitioning  algorithms  used  in  this  work  present  a  good 
work  load  distribution,  but  Al-Nasra  &  Nguyen  algorithm  usually  produces 
boundaries  with  a  greater  number  of  nodes  than  pmetis  and  kmetis  algorithms!  12]. 

The  message  passing  also  influences  the  initialization  process  executed  by  the 
master  program.  In  this  analysis  phase,  the  master  program  spawns  task  programs  and 
sends  all  information  needed  by  tasks  to  perform  the  analysis.  This  information 
includes  geometry  data,  bound^  conditions,  material  properties,  etc.  When  the 
model  size  increases,  the  message  sizes  also  increase,  reducing  consequently  the 
system  performance. 

The  way  the  condensed  stiffness  matrix  (equation  (4))  is  mounted  is  the  most 
important  aspect  that  affects  the  parallel  analysis  system  performance.  First,  system 
equation  (7)  is  solved 

=  (7) 


where  is  a  column  of  , 

7(0  ,770) 
ks  of  Ks 


and  the  operation  (8)  is  carried  out  to  obtain  a  line 


-H.’-jr®. 


(8) 


Therefore,  solving  equations  (7)  and  (8)  to  each  column  ,  all  lines  of  the 

condensed  stiffness  matrix  are  obtained.  This  process  is  highly  influenced  by  the 
number  of  internal  DOF  of  the  substructure  and  by  the  bandwidth  of  the  stiffness 
matrix  Ki^'^  since  this  matrix  must  be  decomposed  to  solve  equation  (7). 


4  System  Performance 

The  parallel  system  performance  is  presented  in  this  section.  Two  versions  of  this 
system  have  been  evaluated  and  compared  for  the  same  model.  The  second  version 
incorporates  implementations  to  improve  the  system  performance. 


1070 


VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


The  hardware  setup  consists  of  a  100  Mbit  Fast-Ethemet  network  cluster  including 
eight  Pentium  200  MHz  micro-computers  nmning  under  LINUX  operating  system. 
This  cluster  is  fully  dedicated  to  parallel  processing. 

The  numerical  example  used  to  measure  system  performance  is  a  beam  with 
geometry,  boundary  conditions  and  mesh  presented  in  Fig.  6.  The  model  attributes 
are:  E  =  7000  kN/cm^,  v  =  0.25  and  thickness  =  1  cm. 


10  kN 


The  model  is  created  and  discretized  using  a  regular  mesh  of  13x130  plane  stress 
Q8  elements.  The  mesh  is  partitioned  into  2  to  8  substructures,  using  the  kmetis 
algorithm  available  in  PARTDOM.  Fig.  7  presents  the  resulting  mesh  partitioning 
when  4  substructures  are  used. 


Fig.  7.  Mesh  partitioning  into  4  substmctures 

Through  the  analysis  of  performance  of  the  first  version  of  the  parallel  system,  the 
main  bottleneck  aspects  have  been  identified.  These  aspects  have  been  presented  in 
the  last  section.  The  second  version  of  the  parallel  system  incorporates  new 
implementations  to  improve  the  performance  of  critical  system  routines. 

In  the  first  FEMOOP  parallel  version,  the  master  program  is  responsible  for 
initialization,  spawning  of  task  programs,  and  sending  all  information  needed  by  tasks 
to  perform  the  actual  analysis.  This  information  include  all  geometry  data,  topological 
data,  material  properties,  boundary  conditions,  and  domain  partitioning  data.  With  a 
large  scale  model,  this  time  step  becomes  significant  In  the  second  version,  the  task 
programs  read  information  directly  from  data  files,  avoiding  large  message 
exchanging. 

Another  aspect  influencing  system  performance  is  the  decomposition  of  K/’’ 
matrix,  requir^  to  mount  the  condensed  stiffness  matrix.  Usually  this  step  causes  an 
unbalanced  computational  work  distribution  among  processors  because  the 
substructures,  generated  through  automatic  domain  partitioiting  algorithms,  present  a 
stiffness  matrix  that  is  not  optimized  to  numeric^  analysis.  Thus,  in  the  second 
version,  FEMOOP  is  able  to  perform  nodal  reordering  for  each  substructure,  reducing 
the  time  to  mount  the  condensed  stiffness  matrix. 

In  the  new  FEMOOP  version,  the  pointers  to  nodal  objects  are  obtained  through  a 
vector  request,  and  not  through  a  linked  list  of  pointers.  The  use  of  a  vector  of 
pointers  increases  the  overall  system  performance. 
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The  main  parallel  analysis  phases  are  presented  in  Fig.  8,  9,  10  and  11.  These 
graphics  compare  the  plme  time  consuming  between  first  and  second  FEMOOP 
versions.  Fig.  8  presents  time  consumed  to  send  substructure  information  to  each 
processor.  Fig.  9  presents  time  consumed  to  mount  the  condensed  stiffness  matrix. 
Fig.  10  presents  time  needed  to  solve  the  Linear  system  of  equations  using  the  parallel 
solver.  And  in  Fig.  11  the  time  consumed  to  perform  complete  analysis  is  depicted. 


Send  Data 


#  of  processors 


Fig.  8.  Time  consumed  to  send  substructure  data  to  processors 


Condensed  Stiffness  Matrix 


I  version  1 
Q  version  2 


#  of  processors 


Fig.  9.  Time  consumed  to  mount  condensed  stiffness  matrix 


1072 


VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Fig.  10.  Time  consumed  to  solve  linear  equation  system  using  the  parallel  solver 


Fig.  11.  Total  time  consiuned  to  complete  the  analysis 


The  effect  of  direct  access  to  data  files  is  evident  (Fig.  8).  The  time  consumed  to 
perform  the  data  transmission  is  much  lower  in  the  second  FEMCX)P  version.  The 
disadvantage  of  direct  access  is  that  an  up-to-date  data  file  must  be  available  to  each 
system  processor. 

The  nodal  reordering  has  a  crucial  efiect  on  the  condensed  stiffness  matrix 
moimting  routine.  Fig.  9  shows  a  great  time  reduction  in  the  second  FEMOOP 
version,  specially  for  a  small  number  of  substructures.  To  mount  the  condensed 
stiffness  matrix,  decomposition  of  matrix  is  needed.  The  time  consumed  in  this 
decomposition  also  decreases  greatly  with  nodal  reordering. 
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The  parallel  solver  is  highly  affected  by  synchronization  problems  (Fig.  10).  The 
first  step  of  the  parallel  solver  is  the  mounting  of  pre-conditioning  matrix.  To 
accomplish  this,  message  passing  is  needed.  Thus,  the  time  consumed  by  the  solver 
incorporates  idle  periods  of  processors  waiting  for  information  ft-om  another  busy 
processor.  This  is  a  problem  that  appears  in  both  FEMOOP  versions.  Nodal 
reordering  has  not  been  sufficient  to  address  this  problem. 

The  new  implementations  allowed  a  reduction  in  the  total  analysis  time  (Fig.  11) 
when  2  to  8  processors  are  used  simultaneously. 


5  Conclusions 

A  parallel  structural  analysis  system  has  been  presented  in  this  work.  This  parallel 
system  penrats  a  better  utilization  of  the  resources  present  in  a  local  area  network. 
This  arrangement  enables  a  low  cost  parallel  structural  analysis  environment, 
avoiding  for  certain  classes  of  problems  the  utilization  of  expensive  supercomputers. 

A  first  version  of  the  parallel  system  has  been  used  to  identify  the  main  aspects 
affecting  the  global  system  performance.  New  implementations  have  been  added  to 
the  code  to  improve  performance.  The  development  of  this  system  takes  advantage  of 
the  object-oriented  programming  concepts,  which  permits  new  implementations 
without  great  impact  over  the  existing  code. 

The  nodal  reordering  and  the  direct  access  to  data  files  improved  greatly  the 
system  performance.  The  reduction  of  the  time  consumed  to  mount  the  condensed 
stiffness  matrix  and  to  send  data  over  to  all  processore  confirm  the  adequacy  of  these 
new  implementations. 

Some  improvements  are  still  needed  to  resolve  the  synchronization  problem  that 
appears  in  the  current  system.  One  of  these  improvements  is  certainly  a  better  work 
distribution  among  processors. 

This  system  will  be  used  in  a  future  work  to  perform  3D  fracture  analysis  of 
cracked  structures.  To  accomplish  this  objective,  new  features  and  improvements  will 
be  necessary.  Automatic  3D  domain  partitioning  and  dynamic  computational  work 
distribution  are  some  of  these  improvements. 
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Abstract 


In  order  to  minimize  the  heavy  cost  of  running  parallel  applica¬ 
tions  on  expensive  MPPs,  we  consider  the  use  of  workstation  clusters. 
The  main  difficulty  to  obtain  efficiency  on  such  architectures  is  the 
fact  that  a  slow  node  may  dramatically  decrease  the  overall  perfor¬ 
mance  of  the  machine.  We  propose  a  dynrunic  load  balancing  solution, 
based  on  a  local  psirtitioning  scheme,  consisting  of  moving  sub-domain 
boundaries  so  as  to  adjust  the  processor  load  according  to  its  available 
CPU  capability. 

1  Introduction 

Traditionally,  heavy  scientific  applications  are  run  on  large  and  expensive  main¬ 
frames  such  as  massively  parallel  computers  (MPPs)  which  are  efficient  but 
rather  expensive.  Price  is  not  the  single  problem:  in  order  to  regulate  the  avail¬ 
able  CPU  resources,  system  managers  configure  the  computer  for  accepting  large 
jobs  only  in  batch  mode.  Then,  even  if  the  mainframe  is  very  power-full,  the  user 
may  wait  for  a  long  time  before  the  job  complete. 

On  the  other  hand,  commodity  computers  like  workstations  or  personnel 
computers  become  more  and  more  powerful  and  cheaper.  With  an  appropriate 
message  passing  library  such  as  PVM,  they  represent  an  interesting  alternative 
to  MPPs  in  order  to  run  heavy  parallel  applications  in  a  fully  interactive  way 
and  at  better  cost. 

However,  using  such  a  solution  is  not  as  simple  as  porting  the  code  from 
the  parallel  mainframe  to  the  workstation  cluster.  A  crucial  problem  of  load 
balancing  may  appear  because,  usually,  the  processing  nodes  enrolled  in  a  cluster 
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may  not  have  all  same  amount  of  CPU  resources.  This  may  be  due  to  a  difference 
in  the  chip  power,  or  because  they  are  not  running  in  a  single-user  mode.  Another 
user  may  need  the  resources  making  the  CPU  balance  very  uneven. 

Such  unbalanced  situations  are  very  crucial  problems  because,  in  most  cases, 
processing  nodes  must  synchronized  from  time  to  time  in  a  parallel  application. 
Since  the  global  time  process  is  lined  up  to  the  slowest  node,  one  slow  node  is 
enough  to  reduce  considerably  the  benefit  of  parallelization.  Unfortunately,  the 
more  nodes  we  use,  the  more  critical  this  situation  will  be.  Thus,  it  is  essential 
to  implement  algorithms  handling  load-unbaJcince  problems. 


2  Load  balancing  algorithm 

Our  dynamic  load  balancing  algorithm  is  designed  for  a  2D  regular  grid  divided 
along  the  north/south  and  east /west  direction  (see  fig.  1).  This  domain  decom¬ 
position  leads  to  a  set  of  rectangular  sub-domains,  each  assigned  to  one  PE. 
Rectangular  shapes  have  the  significant  advantage  that  they  make  possible  fast 
regular  communications  between  adjacent  processors,  as  opposed,  for  instance, 
to  a  cyclic  partitioning. 

Our  algorithm  determines  each  sub-domain  size  according  to  the  work  load 
of  the  PE.  The  sub-domain  owned  by  an  over-loaded  PE  will  shrink  while  that 
of  an  unloaded  PE  will  grow.  A  variation  of  size  is  obtained  by  moving  sub- 
domain  boundaries  in  a  coherent  way:  when  a  sub-domain  attempts  to  grow,  the 
neighbors  are  forced  to  shrink,  as  illustrated  in  fig.  1.  The  problem  is  then  to 
compute,  in  a  minimal  amount  of  time,  the  boundary  motions  which  yield  a  fair 
work  distribution. 

The  measure  of  load  unbalance  we  consider  here  is  based  on  the  time  each 
PE  has  spent  during  the  last  computation  step.  This  measure,  which  has  the 
advantage  of  being  problem-independent,  is  appropriate  in  all  applications  where 
the  same  computation  is  iterated  many  times. 


Fig.  1.  Moving  east  boundary  of  node  E  will  act  on  node  F  west  boundary,  and  also 
on  nodes  H  and  I  boundaries. 
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The  load  balancing  problem  can  be  addressed  in  two  ways  [KW]  [Fos95]: 
on  the  one  hand  we  can  use  a  global  policy  which  is  very  stable  and  accurate 
(because  it  accounts  for  every  processor  load),  but  may  not  be  scalable.  On  the 
other  hand,  we  can  consider  a  local  policy  which  may  have  a  slow  convergence 
and  lack  of  stability,  but  which  only  requires  local  communications. 


2.1  The  global  scheme 


We  consider  n  PEs  involved  in  a  parallel  computation.  At  the  end  of  step  t,  PE 
number  i  knows  Ti^t,  the  time  necessary  to  complete  this  step  on  its  rectangular 
sub-domain  of  sizes  5i,(.  Using  an  all-to-all  communication,  the  values  Ti,f  and 
Sij  are  broadcast  to  all  other  PEs.  From  these  values,  each  PE  computes  the 
ideal  time  Taj 


Tat  =  -y]  Ti,t 


(1) 


and  Tci,(  the  time  used  by  PE  i  for  computing  one  single  cells  of  domain  Sjj.. 


Tci. 


(2) 


Thus,  optimizing  domain  distribution  means  to  find  Si^t+i  minimizing  the  ex¬ 
pression 

\Tat-Tci,tSi,t+i\  (3) 

1=1, n 

A  solution  to  this  optimization  is  obtained  as  follows:  we  first  define  5^,/+!  =  Sjj 
and  sort  all  PEs  according  to  the  value  \Tat  -  Tci^tSi^t+i\  (see  fig.  2).  In  this 
way,  we  find  the  worst  sized  domain  (either  oversized  or  undersized). 

Then  we  try  to  adjust  the  size  of  by  virtually  exchanging  segments  of 

cells  with  the  neighboring  sub-domains  so  as  to  make  Tci^tSi,t+i  ~  Tat-  The  new 
sub-domains  resulting  from  the  motion  of  the  boundary  of  rectangle  7(1)  can  be 
computed  by  a  recursive  procedure.  The  same  algorithm  is  run  by  each  processor 
since  all  of  them  have  a  complete  list  of  the  coordinates  of  the  all  rectangles.  This 
new'  domain  decomposition  is  performed  four  times,  for  an  east,  west,  north  and 
south  boundary  motion  and  the  best  of  these  solutions  is  selected  to  give 
However,  this  optimization  may  not  be  possible  because:  (i)  the  new'  partition 
is  more  unbalanced  than  the  previous  one;  (ii)  a  domain  may  shrunk  beyond  an 
acceptable  size  (e.g.  in  finite  difference  schemes,  a  specific  number  of  neighbor 
cells  are  required);  in  such  a  case  the  optimization  is  considered  for  Syincit)'^ 
When  the  optimization  is  profitable,  the  whole  process  is  repeated  (sorting 
and  optimizing  S)  until  a  pre-assigned  maximum  number  of  steps  is  performed 
or  until  the  difference  of  time  between  the  fastest  and  slowest  PE  is  less  than  a 
given  threshold. 

Once  the  new  partition  is  determined,  the  PEs  whose  domain  has  shrunk 
must  send  the  unused  data  to  the  PEs  whose  domain  has  increased.  Also,  the 
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Fig.  2.  PE  are  sorted  according  to  their  size. 


regular  communications  between  adjacent  PEs  are  aifected  by  the  new  parti¬ 
tion  and  a  new  communication  pattern  has  to  be  generated.  Figure  3  shows  two 
configurations,  before  and  after  load  balancing  (upper-left  and  upper-right  par¬ 
titions,  respectively).  The  data  motion  problem  is  solved  by  superimposing  the 
shape  of  the  new  configuration,  including  ghost  cells  used  for  communication, 
on  top  of  the  old  one.  From  the  intersections,  one  can  determine  which  block 
should  be  sent  and  which  should  be  received. 

This  global  dynamic  load  balancing  algorithm  is  very  efficient  when  used 
with  a  restricted  amount  of  PEs  (up  to  10-16),  because  its  complexity  grows 
very  rapidly  as  the  number  of  PEs  increases.  It  is  then  well  adapted  to  a  coarse 
grain  parallelism. 

2.2  The  local  scheme 

For  a  fine  grain  parallelism,  a  local  policy  is  more  suitable.  The  main  problem 
is  for  a  PE  to  know  whether  it  can  move  its  boundary  since,  usually,  such  a 
modification  affects  more  than  the  immediate  neighbors.  In  our  local  strategy, 
each  PE  converses  several  time  with  its  immediate  neighbors,  in  order  to  let 
information  propagate. 

First,  each  PE  compares  its  last  CPU  time  measurement  wnth  those  of  its 
adjacent  neighbors.  The  locally  slower  PEs  try  to  shrink  their  domains,  w^hile 
the  faster  ones  try  to  obtain  a  larger  domain.  This  decision  is  taken  jointly  by 
propagating  a  request  among  the  PEs  concerned  by  the  change  (see  fig.  4).  The 
algorithm  proceeds  as  follows:  let  us  assume  that  n  PEs  have  a  north/south 
boundary  along  the  same  vertical  line.  In  the  worst  case,  a  full  column  of  PEs 
could  be  concerned  and  n  would  be  of  the  order  y/p,  where  p  is  the  total  number 
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Fig.  3.  Use  of  masks  to  determine  data  migration  between  two  configurations  (the 
upper-left  to  upper-right  configuration).  We  apply  to  PE  E  the  new  partition  shape 
(including  ghost  border).  One  block  need  not  migrate,  whereas  five  blocks  must  be 
imported  from  neighbors. 


Fig.  4.  Taking  a  coherent  decision  about  moving  a  boundary  is  difficult  because  this 
affects  non-local  neighbors;  if  PE  D  wants  to  move  its  eastern  boundary,  it  will  move 
PE  K  western  one.  The  decision  of  how  much  each  boundary  will  move  is  made  by- 
propagating  request  messages  across  the  affected  region. 
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of  processor.  When  PE  number  i  wants  to  move  its  east  boundary,  it  exchanges 
with  all  its  adjacent  east  neighbors  a  message  of  size  2  *  n  -  1  with,  in  position 
i,  a  request  containing  the  amplitude  of  the  desired  motion.  The  message  size 
2  *  n  —  1  ensures  that  all  PEs  along  the  same  vertical  boundary  can  insert  in 
the  message  the  requested  motion.  Once  all  slots  of  the  message  are  filled  in  (i.e. 
after  2*n  —  l  iterations),  the  processors  may  take  a  common  decision,  which  is,  in 
the  current  implementation,  to  move  the  boundary  by  the  average  displacement. 
The  same  procedure  can  then  be  repeated  across  the  east/ west  boundaries. 

When  each  PE  knows  in  which  way  it  will  have  to  move  its  boundaries,  it 
must  determine  if  its  new  neighbors  lists  resulting  from  the  modified  partition 
has  changed  (see  fig.  5).  To  obtain  this  information,  it  must  know  the  new  coor¬ 
dinates  of  the  domains  owned  by  the  surrounding  PEs.  This  can  be  obtained  by 
propagating  each  PE  coordinates  a  number  max(m,n)  of  times  along  the  con¬ 
nected  sub-domains,  where  m  is  the  number  of  PEs  sharing  the  same  horizontal 
boundary.  Then,  it  is  possible  to  reorganize  the  data  using  the  same  technique 
as  discussed  in  section  2.1,  but  with  a  local  computation. 
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Fig.  5.  Due  to  the  east  boundary  modification,  PE  E  will  have  to  add  PE  C  in  its 
neighbors  list. 


3  Applications 

3.1  Modeling  pollution  transport 

We  have  parallelized  an  atmospheric  pollution  transport  application  involving  35 
chemical  species  [MOC"*'ed],  distributed  on  a  3D  regular  grid  and  implementing 
more  than  one  hundred  chemical  equations  depending  on  various  atmospheric 
parameters  like  wind,  temperature  and  humidity.  For  designing  reasons,  the  3D 
spatial  domain  is  divided  in  columns  along  the  horizontal  plane.  After  each 
useful  calculation  step,  a  global  communication  step  is  done  to  determine  if 
load  balancing  is  necessar)-.  This  kind  of  compute  intensive  problem  on  a  small 
domain  (grid  dimension  is  about  50x40  columns)  is  adapted  to  test  our  global 
load  balancing  approach. 
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We  have  ported  the  application  on  our  Department  workstations  which  are 
Sun  Sparc  4  &  5  and  ultra  1  machines.  In  this  case  load  balancing  is  critical 
because  there  is  not  any  control  on  each  CPU  availability.  Early  in  the  morning 
(where  little  activity  is  recorded  but  different  architectures  are  used)  our  problem 
can  be  solved,  without  load  balancing,  in  IhlSmn.  This  can  easily  turn  to  2  hours 
in  middle  of  the  day.  When  turning  on  our  global  load  balancing  algorithm,  a  run 
takes  only  about  42mn,  even  during  the  peak  activity  time.  Figure  6  show  the 
time  per  computation  step  as  a  function  of  the  iteration.  Without  load  balancing, 
we  can  see  a  significant  variation  of  performance  among  the  machines.  With  load 
balancing,  the  time  spent  by  each  PE  is  much  more  homogeneous. 


Global  scheme 

Load  balancing  off 


CJ 

(D 


0  ‘ - - - 

0  so  too  150 


Global  scheme 

Load  balancing  on 


steps 


Fig.  6.  CPU  time  per  iteration  as  a  function  of  the  iteration,  for  16  heterogeneous  work¬ 
stations  running  the  pollution  transport  code.  On  the  left,  the  global  load  balancing 
algorithm  is  off  and,  on  the  right,  it  is  on. 


3.2  Wave  propagation 

A  second  application  we  have  parallelized  on  a  cluster  of  workstations  is  a  wave 
propagation  model  for  urban  environment.  This  application  is  based  on  the  lat¬ 
tice  Boltzmann  method  [CLW97]  and  uses  a  numerical  scheme  closed  to  the 
so-called  TLM  method  (Transmission  Line  Matrix).  Wave  propagation  is  a  im¬ 
portant  problem  when  designing  a  mobile  communication  network.  Figure  7 
illustrates  the  wave  intensity  pattern  predicted  by  our  application  in  the  case 
where  an  antenna  is  located  in  the  middle  of  an  urban  area. 

This  wave  propagation  model  requires  15  floating  point  operations  per  grid 
site  and  four  communications  steps  (with  the  east,  north,  west  and  south  neigh¬ 
bors).  Since  the  computation  step  is  rather  light,  synchronization  of  the  PE  is 
frequent  and  this  problem  may  be  difficult  to  parallelize  in  a  efficient  way  on  a 
workstation  cluster  with  low'  communication  bandwidth  and  high  latency.  With 
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such  conditions  it  seems  to  be  very  difficult  to  obtain  a  good  load  balancing 
without  a  significant  overhead. 

This  application  corresponds  to  a  real-life  problem.  The  simplicity  of  the 
numerical  scheme  makes  it  a  good  candidate  for  an  interesting  benchmark  of 
a  fine  grain  parallel  application.  Table  1  shows  the  performance  obtained  for  a 
sequential  implementation  of  this  code. 


CPU 

frequency 

MHz 

memory 

cache 

600x600x1600 
grid  (M-flops) 

DEC  alpha 

150 

96KB 

6.0 

n/a 

Sun  Sparc4 

110 

256KB 

7.2 

5.4 

Intel  Ppro 

200 

512KB 

17.1 

16.2 

Sun  Ultral 

166 

512KB 

18.5 

11.8 

IBM  RS6000 

66 

64KB 

24.9 

10.7 

Intel  PII 

266 

1MB 

25.9 

SG  RSIOOOO 

175 

1MB 

33.1 

22.4 

Table  1.  Some  benchmark  results  with  the  best  optimized  sequential  version  of  our 
TLM  wave  propagation  model.  Note  that  all  these  figures  are  obtained  for  simulations 
on  square  domains,  without  buildings.  Two  tests  were  done:  the  first  one  corresponds 
to  running  the  application  on  a  grid  of  size  200x200  for  800  iterations;  the  second 
test  is  for  a  800x800  grid  during  1600  iterations.  Due  to  cache  missed  augmentation, 
performances  decrease  when  domain  sizes  increases.  RSIOOOO  results  may  be  pessimistic 
because  of  a  possible  overload  of  the  machine  when  benchmark  were  done. 


The  parallelized  version  of  the  TLM  simulations  has  been  run  on  a  32-PC 
cluster  running  Linux  and  PVM  [BDG‘'‘91],  each  node  being  a  Pentium  II 
266  MHz  with  64MB  of  memory  .  All  PCs  are  interconnected  using  two  18-entry 
switches,  with  fast-internet  links.  A  Sun  Sparc  5  with  NFS  is  used  as  file  server. 
The  advantage  of  a  such  solution  is  a  very  low  price  (about  60’000  dollars  )  with 
pretty  good  performances,  a  full  control  on  each  node  since  the  machine  can  be 
dedicated  to  only  parallel  applications. 

Load  imbalances  are  due  to  the  fact  that,  during  the  first  iterations,  only 
the  processor  containing  the  antenna  site  must  perform  useful  computations. 
As  the  wave  front  propagates,  all  processors  become  active.  The  presence  of 
buildings  may  be  another  reason  to  produce  an  uneven  load  distribution  among 
the  processors  because  no  computation  is  needed  on  building  sites. 

Due  to  some  specific  optimization  techniques,  only  valid  in  the  sequential 
version  of  the  code  (memory  moves  replaced  by  alias  pointers  ),  the  parallel 
version  is  slower  by  a  factor  2,  even  if  run  on  a  single  processor:  on  a  200  x  200 
domain,  800  iterations  take  16.2  seconds  to  complete  for  the  sequential  code  and 
32.8  for  the  parallel  one.  Therefore,  we  do  no  expect  an  efficiency  larger  than 
50%  for  the  parallel  implementation.  However,  as  the  problem  size  increases 
(e.g.  1600  X  1600  during  4800  iterations),  the  code  no  longer  runs  on  a  single  PE 
without  swapping.  This  makes  the  parallel  implementation  very  effective 
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Fig.  7.  Processing  wave  propagation  in  urban  environment  with  16  workstations,  using 
our  load  balancing  scheme.  The  rectangles  show  the  block  of  data  assigned  to  each 
processor. 


For  this  problem  size,  the  parallel  code  runs  in  1026  sec,  without  load  bal¬ 
ancing  and  using  16  free  PEs.  In  fig.  8  we  show  the  evolution  of  the  time  needed 
to  perform  chunk  of  computation  corresponding  to  16  propagation  steps.  After 
an  initial  stage  (where  some  Pes  have  not  yet  been  reached  by  the  wave),  the 
computation  time  becomes  uniform  and  each  chunk  of  16  consecutive  iterations 
takes  about  4.5  seconds. 

In  order  to  test  our  local  load  balancing  algorithm  we  launch  on  a  node  a 
program  aimed  at  perturbing  the  computation.  This  program,  given  in  the  ap¬ 
pendix,  uses  lot  of  CPU.  As  a  consequence  of  this  extra  load,  the  TLM  execution 
time  increases  to  4976  sec  and  the  infected  node  performs  the  16  steps  in  about 
25  sec,  thus  slowing  down  the  whole  system. 

Under  the  same  condition,  the  local  load  balancing  algorithms,  modify  the 
domain  partition  and  the  total  CPU  time  come  down  to  2243  sec.  This  effect  is 
illustrated  in  fig  8  is  erased.  However  this  still  is  slower  than  the  version  without 
perturbation.  This  factor  is  due  to  poor  stability  of  local-loadbalancing  which 
will  improved  in  further  work. 

Figure  9  show's  the  speedup  obtained  with  and  without  load  balancing.  Com¬ 
parison  is  made  with  the  most  optimized  sequential  version  [Lut98]. 

4  Conclusion 

From  these  results  we  conclude  that  it  is  perfectly  possible  to  use  clusters  of  work¬ 
stations  (or  PCs)  instead  of  heavy,  expensive  and  less  convenient  mainframes. 
With  our  approach  we  could  obtain  satisfactory  results  in  terms  of  speedup  and 
load  balancing,  at  very  low  cost. 

Some  parts  of  our  load  balancing  algorithms  are  still  heuristically  determined 
and  the  w'orst  case  complexity  of  the  overhead.  The  convergence  time  and  stabil¬ 
ity  of  the  partitioning  procedure  must  still  be  investigated.  However,  the  results 
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Fig.  8.  Evolution  of  the  CPU  time  needed  to  perform  16  iterations  of  the  TLM  problem, 
with  16  Pentium  11-266.  On  the  left,  all  PEs  are  dedicated  to  the  computation  and  a 
balanced  situation  is  observed.  The  middle  plot  corresponds  to  a  situation  where  2 
nodes  are  disturbed  with  a  intensive  CPU  consumer  program.  A  clear  load  unbalance 
is  observed,  unless  our  local  algorithm  is  turned  on,  as  shown  on  the  left. 


Speedup  running  TLM 


Fig.  9.  Speedup  computed  for  problems  of  both  sizes  1600  x  1600  and  800  x  800  run 
for  4800  iterations.  The  best  sequential  time,  estimated  to  118  minutes,  is  interpolated 
from  the  timing  of  the  800  problem  size  (since  the  large  grid  size  would  cause  swapping 
on  one  single  PE).  Performances  of  the  load-balancing  algorithm  might  improved  bv 
decreasing  the  unstability  seen  in  fig  8. 
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discussed  here  show  clearly  the  benefit  of  our  load  balancing  strategies  when  the 
PEs  are  heterogeneous  or  load  in  an  unpredictable  way  by  other  jobs. 
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Appendix:  Slowing  down  a  node 

This  program  is  used  to  slow  down  a  Unix  system:  it  starts  to  fork  and  produces 
n  copies  of  itself.  Each  copy  allocates  a  big  block  of  memory  and  accesses  it 
randomly.  When  the  parent  process  is  killed  with  the  kill  -USRl  pid  command, 
all  child  processes  are  killed  too.  Note  that  the  CPU  overload  created  is  not 
constant  because  swapping  events  may  randomly  appear. 


/* 

♦  little  program  used  to  slow  down  a  UNIX  station 
*/ 

tinclude  <stdlib.h> 

#include  <stdio.h> 

#include  <signal.h> 

/♦  memory  used  per  process  (IMo)  */ 

#define  BLOCK  0x100000 

/*  some  global  datas  */ 
int»  pids; 
int  nb.childs; 

/*  to  kill  all  processes  */ 
void  kill.childs 0 
{  int  i I 

for  (i=0:i<nb_childs;i++)  kill (pids [i] ,SIGUSR2) ; 
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printf ("\nOk,  I’m  dead.\n"); 
exit (0) ; 


void  main(chax  argc.chax**  argv) 

{  double*  datas; 
int  i; 

int  nb.el  =  BLOCK/sizeof (double) ; 
int  pid  =  1; 

signal (SIGUSRl,  kill.childs) ;  /*  to  kill  all  processes  */ 

if  (argc!=2)  {printf  ("Usage  :  */,s  nb_proc6sses\n", argv [0] ); exit (0) ;} 
nb.childs  =  atoi(2u:gv[l])-l;  /*  get  nb  processes  from  commzmd  line*/ 

printf  ("Starting  with  7.d  processes\n"  .nb_childs+l) ; 

printf  ("  to  kill  me  use  the  \"kill  -USRl  '/.d\"  coiiimand\n"  .getpidO )  ; 

pids  =  (int*)  malloc  (nb_childs*sizeof (int) ) ;  /*  child  ids  reminder  */ 

datas  =  (double*)  calloc(nb_el ,sizeof (double) ) ;  /*  memory  allocation  */ 

for  (i=0; i<nb_childs : i++)  /♦  creating  child  */ 
if  (pid)  {pid=fork()  ;pids[i]=pid;} 

srand  (getpidO);  /*  pseudo-random-number  generator  init  */ 
while  (1) {datas [rand()*/,nb_el]++:]-  /♦  remdom  accesses  in  memory  */ 


